What is Kubernetes Orchestration? Benefits & Best Practices

text
[2024-05-22T03:15:02.482Z] ERROR: k8s-api-server: pod/checkout-service-7f8d9b6c4-x9z2q status: CrashLoopBackOff
[2024-05-22T03:15:04.112Z] INFO: kubelet: Back-off restarting failed container checkout-service in pod checkout-service-7f8d9b6c4-x9z2q_prod(8d2f…)
[2024-05-22T03:15:10.991Z] WARN: etcd: request “PUT /registry/services/endpoints/prod/checkout-service” took too long (1.4s)
[2024-05-22T03:15:12.001Z] FATAL: controller-manager: Failed to sync node status for ip-10-0-42-11.ec2.internal: connection refused
[2024-05-22T03:15:15.442Z] STACKTRACE: goroutine 4021 [running]: k8s.io/kubernetes/pkg/controller/endpoint.(*Controller).syncService…

It’s 3:15 AM. My third cup of lukewarm, sludge-like coffee is staring back at me with more life in it than I have left in my marrow. I’ve been staring at this terminal for 72 hours. My eyes feel like they’ve been scrubbed with industrial-grade sandpaper. The Slack "knock-brush" sound is currently triggering a fight-or-flight response that my adrenal glands are too exhausted to fulfill. 

We were told Kubernetes would be the "operating system of the cloud." They told us it would handle the heavy lifting. They lied. Kubernetes isn't an operating system; it’s a Rube Goldberg machine built out of brittle YAML and false promises, held together by the sheer, desperate willpower of SREs who just want to sleep for six consecutive hours.

## The Hubris of the v1.30.2 Migration

It started with a Jira ticket. "Upgrade cluster to v1.30.2 to leverage latest security patches and API stability." We were on 1.28. We were happy. Or as happy as you can be when you’re managing a fleet of 400 nodes. But the "thought leaders" in the architecture guild decided we needed to be on the bleeding edge. 

The migration to v1.30.2 wasn't just a version bump; it was a descent into an architectural abyss. They graduated `FlowSchema` and `PriorityLevelConfiguration` to `v1`, and suddenly, our custom admission controllers—the ones written by a guy who left the company three years ago to start a goat farm—started vomiting errors. The API server began throttling requests because the default concurrency limits in v1.30.2 are more aggressive than a debt collector.

We thought we could just "orchestrate" our way out of it. We thought the control plane would protect us. But the control plane is a fickle god. When you’re running at scale, the abstraction layer doesn't hide complexity; it just buries it under ten layers of opaque logging and "helpful" automation that deletes your production workloads because a heartbeat packet was three milliseconds late.

## The YAML Indentation that Cost Us $40k

Let’s talk about the "configuration" aspect of orchestration. We’re told that declarative configuration is the path to enlightenment. In reality, it’s a path to a $40,000 AWS bill because a junior dev missed two spaces in a `resources` block.

Look at this manifest. This is what brought down the checkout service. See if you can spot the "orchestration" feature that turned into a bug.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
  namespace: prod
spec:
  replicas: 50
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
      - name: checkout-service
        image: internal-registry.io/checkout:v2.4.1
        ports:
        - containerPort: 8080
        resources:
        requests:
          memory: "256Mi"
          cpu: "100m"
          limits:
            memory: "2Gi"
            cpu: "1"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 3

Did you see it? The requests and limits are siblings under resources. But wait—look at the indentation of limits. It’s nested under requests because of a copy-paste error from a “best practices” blog post. The parser didn’t complain. It just ignored the limits entirely.

The result? Fifty pods spun up with no CPU or memory caps. The checkout-service, which has a memory leak we’ve been “monitoring” since 2022, proceeded to consume every available byte on the worker nodes. The OOMKiller, K8s’s resident sociopath, woke up and started murdering critical system processes—including the kubelet and the log-collector.

The “orchestrator” saw the pods dying and, in its infinite, automated wisdom, decided the best course of action was to reschedule them onto the other healthy nodes. It was a digital plague. A cascading failure of “self-healing” that wiped out three availability zones in fifteen minutes.

Table of Contents

Etcd is Not Your Friend

If Kubernetes is the brain, etcd is the nervous system. And our nervous system is currently having a grand mal seizure.

When you scale a cluster, everyone talks about node count. Nobody talks about the etcd commit latency. We’re running on NVMe drives, and we’re still seeing request took too long warnings. Why? Because orchestration creates a deafening amount of chatter. Every time a pod changes state, every time a secret is updated, every time a ConfigMap is touched, etcd has to reach consensus.

# Checking the health of the nervous system while the world burns
$ kubectl get componentstatuses
Warning: v1 ComponentStatus is deprecated in v1.19+
NAME                 STATUS    MESSAGE                         ERROR
controller-manager   Healthy   ok                              
scheduler            Healthy   ok                              
etcd-0               Unhealthy {"health":"false","error":"etcdserver: request timed out"}

When etcd lags, the entire illusion of orchestration shatters. The API server starts returning 504s. The kube-scheduler stops scheduling. You’re left with a static snapshot of a dying system, and you can’t even run a kubectl delete pod to stop the bleeding because the command can’t be persisted to the state store.

We spent six hours tuning the --quota-backend-bytes and the --heartbeat-interval. We weren’t “innovating.” We weren’t “delivering value.” We were performing open-heart surgery on a database that thinks 10ms of disk latency is a death sentence.

The CNI Rabbit Hole: Flannel, Calico, and the Ghost of BGP

Orchestration implies that networking is a solved problem. “Just use a CNI,” they say. “It’s a flat network,” they say.

They don’t mention the VXLAN overhead. They don’t mention the conntrack table exhaustion that happens when your microservices start talking to each other like caffeinated teenagers. We’re using Calico with BGP peering, and yesterday, a top-of-rack switch decided it didn’t like the number of routes we were pushing.

The “orchestrator” didn’t know. It kept sending traffic into a black hole.

# Digging through the wreckage of the node-to-node mesh
$ ip route show type unicast
default via 10.0.0.1 dev eth0 proto dhcp src 10.0.42.11 metric 100 
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink 
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 
# Error: RTNETLINK answers: No such process

I had to explain to a Product Manager why the “seamless” failover didn’t work. I had to explain that the “overlay network” had an MTU mismatch with the “underlay network,” causing fragmented packets to be dropped by the firewall. He asked if we could “just use a LoadBalancer.” I almost threw my mechanical keyboard through the window.

The plumbing required to make two pods talk to each other across a VPC boundary is more complex than the actual application code. We have kube-proxy running in IPVS mode, CoreDNS struggling with ndots:5 search suffixes that cause five DNS lookups for every single internal request, and a service mesh that adds 10ms of latency to every hop just so we can have a pretty dashboard that nobody looks at.

CSI Drivers and the Lie of Persistent State

“Kubernetes is great for stateful workloads now!”

That’s the lie they tell you at KubeCon. The reality is the Container Storage Interface (CSI). We’re using the AWS EBS CSI driver. When a node dies—which it does, because we’re using Spot Instances to “save money”—the EBS volume stays “attached” to the dead node.

The orchestrator tries to start the pod on a new node. The new node tries to attach the volume. AWS says, “No, this volume is already attached to i-0abc123.” The pod stays in ContainerCreating for eternity.

$ kubectl describe pod checkout-db-0
Events:
  Type     Reason              Age                From                     Message
  ----     ------              ----               ----                     -------
  Warning  FailedAttachVolume  12m                attachdetach-controller  Multi-Attach error for volume "pvc-8d2f..." Volume is already used by pod...
  Warning  FailedMount         2m (x5 over 10m)   kubelet                  MountVolume.SetUp failed for volume "pvc-8d2f..." : rpc error: code = Internal desc = Could not attach volume...

I spent four hours at 2:00 AM manually detaching volumes via the AWS CLI because the “orchestrator” was stuck in a retry loop that it couldn’t break out of. This is the “automation” we were promised. We’ve traded manual server configuration for manual API cleanup. It’s not progress; it’s just a different flavor of suffering.

Admission Controllers: The Gatekeepers of My Insomnia

Then there are the Admission Controllers. In v1.30.2, the ValidatingAdmissionPolicy is supposed to make things easier. Instead, it’s just another place for logic to go to die.

We have a Mutating Admission Webhook that injects sidecars for logging. If that webhook takes more than 30 seconds to respond—say, because the logging service is under load—the entire cluster stops accepting new pods. You can’t even scale down because the calls to the API server are blocked by the very “policy” meant to keep the cluster safe.

I had to disable the webhook manually by editing the API server manifest on the master nodes while the CTO was breathing down my neck on a Zoom call.

“Why is it taking so long?” he asked.
“Because I’m navigating a distributed system’s failure modes in a text editor over a high-latency SSH connection,” I didn’t say. I just grunted and typed docker ps | grep apiserver.

The Cognitive Load of the “Plumbing”

Let’s be real: we aren’t building applications anymore. We’re building infrastructure to support the infrastructure.

To get a “Hello World” app running with high availability in our v1.30.2 cluster, you need:
1. A Deployment (with correctly indented resources).
2. A Service (to provide a stable IP).
3. An Ingress (to map the URL).
4. A HorizontalPodAutoscaler (to handle the load).
5. A PodDisruptionBudget (so the cluster autoscaler doesn’t kill all pods at once).
6. NetworkPolicies (so the DB isn’t exposed to the internet).
7. ServiceAccounts, Roles, and RoleBindings (for RBAC).
8. Secrets (which are just Base64 encoded strings, not actually secret).
9. ConfigMaps (for the environment variables).
10. A sidecar for metrics.
11. A sidecar for logs.
12. A sidecar for the service mesh.

That’s twelve distinct objects for one functional line of code. The cognitive load is staggering. We have developers who have been at the company for two years and still don’t know how their code actually gets to a CPU. They just “push to git” and pray the “orchestration” works. When it doesn’t, they call me.

And I’m tired. I’m tired of the abstraction layers. I’m tired of the “thought leadership” that says we should move to a multi-cluster, multi-region mesh when we can’t even get CoreDNS to resolve internal hostnames consistently in a single namespace.

# The final cry of a dying kubelet
$ journalctl -u kubelet -n 100 --no-pager
May 22 03:20:11 ip-10-0-42-11 kubelet[1204]: E0522 03:20:11.112 node_container_manager_linux.go:62] Failed to create container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8d2f...: cgroups: res_parent_not_found_error
May 22 03:20:15 ip-10-0-42-11 kubelet[1204]: E0522 03:20:15.442 kubelet.go:2855] "Error getting node" err="node \"ip-10-0-42-11.ec2.internal\" not found"
May 22 03:20:20 ip-10-0-42-11 kubelet[1204]: F0522 03:20:20.991 kubelet.go:1400] Failed to initialize runtime: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"

The containerd socket is gone. The node has effectively committed suicide. The orchestrator will try to replace it, but the underlying EBS volume is still locked, the CNI is still trying to route traffic to the dead IP, and the HPA is screaming for more replicas that it can’t place.

This is the reality of Kubernetes orchestration. It’s not a “vibrant ecosystem.” It’s a graveyard of complexity where we bury our technical debt in 2,000-line YAML files. We’ve built a system so complex that no single human can understand the entire stack. We’ve automated the easy stuff and made the hard stuff impossible.

I’m going to finish this coffee. I’m going to delete the stuck VolumeAttachment objects manually. I’m going to restart the etcd members one by one. And then I’m going to go home, delete Slack from my phone, and sleep until the next “breaking change” in v1.31.

If you’re a junior dev reading this, thinking about how “cool” it is to manage clusters: go learn COBOL. Go work on a mainframe. At least there, when the system fails, it has the decency to tell you why in a language that doesn’t involve fifteen nested layers of abstraction and a “community-driven” CNI plugin that hasn’t been updated in eight months. Kubernetes doesn’t want to help you. It wants to be fed. And right now, it’s eating my sanity.

Stay away from the “orchestration” trap. The “control plane” is just a fancy word for a black box that eats your weekends. If you see a YAML file longer than fifty lines, run. If someone mentions “service mesh” in a meeting, quit. There is no “seamless” transition. There is only the long, slow grind of fixing things that shouldn’t be broken in the first place.

Now, if you’ll excuse me, I have to go find out why my kube-proxy pods are stuck in ImagePullBackOff because the internal registry’s certificate expired while I was writing this. The orchestration never stops. It just finds new ways to fail.

Etcd is Not Your Friend

The CNI Rabbit Hole: Flannel, Calico, and the Ghost of BGP

CSI Drivers and the Lie of Persistent State

Admission Controllers: The Gatekeepers of My Insomnia

The Cognitive Load of the “Plumbing”

Leave a Comment Cancel reply