10 Kubernetes Best Practices for Production Success

LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning BackOff pod/api-gateway-7f8d9b-x2k Back-off restarting failed container
4s Warning Unhealthy pod/auth-svc-66c4d-99z Liveness probe failed: HTTP probe failed with statuscode: 503
1s Normal Killing pod/payment-worker-88v Stopping container payment-worker
0s Warning EvictionThreshold node/ip-10-0-42-101.ec2.internal The node was low on resource: memory.
[124892.12] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod-a1...
[124892.15] Memory cgroup out of memory: Killed process 12409 (java) total-vm:12.4GB, anon-rss:2.1GB, file-rss:0B, shmem-rss:0B
[124892.20] etcdserver: failed to send out heartbeat on time (deadline exceeded for 142.12ms)

I haven’t slept in 72 hours. My eyes feel like they’ve been scrubbed with industrial-grade sandpaper. The “DevOps Lead”—a man who thinks YAML is a programming language and Docker is a “cloud OS”—just asked if we can “reboot the internet.” I didn’t answer. I just stared at the flickering cursor on my terminal.

We migrated to Kubernetes v1.29.2 last week. The “YAML engineers” promised it would be “efficient.” They lied. They treated the kernel like a black box and the control plane like a magic wand. They ignored every warning. Now, the cluster is a graveyard of crashed pods and timed-out heartbeats. This is how it happened. This is how you avoid it.

Table of Contents

The Resource Limit Lie

The first domino fell because of a misunderstanding of how the Linux kernel handles memory. Our “architects” decided that setting limits without requests was a great way to “save money.”

# THE CRIME
apiVersion: v1
kind: Pod
metadata:
  name: memory-hog
spec:
  containers:
  - name: app
    image: our-shitty-app:latest
    resources:
      limits:
        memory: "2Gi"
        cpu: "1"
      # No requests. The scheduler is flying blind.

When you omit requests, Kubernetes defaults them to match your limits. This sounds fine until you realize the scheduler now thinks every node has infinite capacity because it isn’t accounting for the actual baseline usage. We hit v1.29.2, which utilizes Cgroup v2 more aggressively. The Kubelet’s NodeAllocatable was ignored. The scheduler packed 40 of these pods onto a node that only had 64GB of RAM.

The math doesn’t work. The kernel OOM killer doesn’t care about your “cloud-native” dreams. When the node hit 95% utilization, the Kubelet tried to evict pods. But since these were all “Burstable” QoS class pods (because requests matched limits, but the actual usage was spiking), the kernel started reaping processes at random.

The kubernetes best practice is simple: Always set requests equal to limits for production workloads. This puts the pod in the “Guaranteed” QoS class. The oom_score_adj for a Guaranteed pod is -997. It is the last thing the kernel kills. If you don’t do this, you are telling the kernel that your application is disposable.

# THE FIX
    resources:
      requests:
        memory: "2Gi"
        cpu: "1"
      limits:
        memory: "2Gi"
        cpu: "1"

The Kubelet in v1.29.2 is more sensitive to memory pressure. If you don’t define your requests, the topologyManager can’t align your CPU and memory resources correctly. You get context switching. You get latency. You get fired.

Why Your Liveness Probes are Killing the Database

While the nodes were screaming, the “YAML engineers” made it worse. They configured liveness probes that queried the database.

# THE SUICIDE NOTE
livenessProbe:
  httpGet:
    path: /health-check-that-queries-db
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

The database was struggling because of the connection churn from the OOM-looping pods. It slowed down. The liveness probe took 2 seconds instead of 200ms. The Kubelet, being a dutiful soldier, decided the pod was dead. It killed the pod. It started a new one. The new pod immediately tried to connect to the database to run its “startup logic.”

Multiply this by 500 pods. You just launched a self-inflicted DDoS attack against your own RDS instance. The database CPU hit 100%. Every single pod in the cluster failed its liveness probe simultaneously.

The kubernetes best approach is to decouple your probes. A liveness probe should only check if the process is alive. It should check a local, in-memory flag. It should never, under any circumstances, cross a network boundary. Use startupProbes for long-running migrations and readinessProbes to gate traffic. If the DB is down, the pod shouldn’t be killed; it should just stop receiving traffic.

# THE STABLE CONFIG
startupProbe:
  httpGet:
    path: /ready
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /live # Just returns 200 OK from memory
    port: 8080
  periodSeconds: 20

In v1.29, the Kubelet has improved internal locking for probes. But if your probe logic is garbage, the Kubelet just executes garbage faster.

Pod Disruption Budgets: The Self-Inflicted Denial of Service

During the height of the outage, I tried to drain a failing node.

kubectl drain ip-10-0-42-101.ec2.internal --ignore-daemonsets --delete-emptydir-data

The command hung.

evicting pod "auth-svc-66c4d-99z"
error when evicting pods/"auth-svc-66c4d-99z" (assigned to node "ip-10-0-42-101.ec2.internal"): Cannot evict pod as it would violate the pod's disruption budget.

Some genius had set a PodDisruptionBudget with minAvailable: 100%.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: auth-pdb
spec:
  minAvailable: 100% # This is a hostage situation
  selector:
    matchLabels:
      app: auth-svc

The app had 3 replicas. One was already down due to the OOM issues. The PDB prevented the eviction of the remaining two. I couldn’t drain the node. I couldn’t patch the underlying kernel issue. I was locked out of my own infrastructure by a YAML file.

A kubernetes best practice is to use maxUnavailable: 1. This ensures that at least one pod can always be evicted, allowing the cluster to self-heal and nodes to be rotated. If your app can’t handle one instance being down, your app isn’t “cloud-native”; it’s a legacy monolith in a trench coat.

The ‘Latest’ Tag Suicide Pact

As I fought to stabilize the API server, a new set of errors appeared in the logs.

ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "our-repo/payment-worker:latest": failed to resolve reference "our-repo/payment-worker:latest": pull access denied

The CI/CD pipeline had just pushed a broken image to the latest tag. Because the pods were crashing and restarting, they pulled the “latest” version. Half the cluster was running v1.0.4, and the other half was trying to run a broken v1.1.0 that someone had pushed five minutes ago.

Never use latest. It is non-deterministic. It is the antithesis of immutable infrastructure. In v1.29.2, the container runtime (containerd) is more efficient at caching, but that doesn’t help you when the tag itself is a moving target.

Use SHA-256 digests or at least specific semantic versions.

# THE ONLY WAY
image: our-repo/payment-worker:v1.0.4@sha256:7f8d9b...

If we had used immutable tags, the crashing pods would have at least restarted with the previously working code. Instead, we were debugging a production outage and a broken deployment simultaneously.

Etcd and the Disk Latency Guillotine

The control plane finally died at 3:00 AM. kubectl commands returned Error from server (Timeout).

I checked the etcd logs. wal: sync duration of 500ms, expected less than 100ms.

The “YAML engineers” had provisioned the control plane nodes with standard GP3 EBS volumes but didn’t bother to check the IOPS. They also decided to run a logging agent on the same nodes that was writing 50GB of JSON logs to the same disk as the etcd WAL (Write Ahead Log).

Etcd is the brain of Kubernetes. It is extremely sensitive to disk latency. When the WAL write takes too long, etcd misses its heartbeat. The cluster loses its leader. It triggers an election. During the election, the API server is read-only or completely unresponsive.

In v1.29, the API server’s Priority-and-Fairness (APF) settings are more robust, but they can’t save you if the underlying data store is stuck in an I/O wait state.

I had to manually SSH into the master nodes, kill the logging agent, and move the etcd data directory to a dedicated NVMe drive.

# Emergency surgery
systemctl stop etcd
mount /dev/nvme1n1 /var/lib/etcd
rsync -av /var/lib/etcd_old/ /var/lib/etcd/
systemctl start etcd

If you are running your own control plane, etcd must have its own dedicated, low-latency disk. This isn’t a suggestion. It is a requirement for survival.

Admission Webhooks: The Silent Killer

The final blow came from a “Security Admission Controller” that a third-party vendor had installed. It was a mutating webhook. Every time a pod was created, the API server sent a JSON payload to this webhook to “validate” it.

The webhook was running inside the cluster.

The cluster was failing. The webhook pods were OOMKilled. The API server was configured with failurePolicy: Fail.

# THE TRAP
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
  - name: security-check.example.com
    failurePolicy: Fail # The cluster will now die if this pod dies
    rules:
      - operations: ["CREATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]

Since the webhook was down, the API server refused to create any new pods. I couldn’t even deploy a fix because the API server would try to validate the fix against the dead webhook and fail.

I had to manually edit the ValidatingWebhookConfiguration to set the failurePolicy to Ignore just to get the cluster to breathe again.

Admission webhooks must have a timeout (v1.29 defaults to 10 seconds, which is still too long) and a sensible failure policy. If the webhook isn’t critical for life-safety, set it to Ignore. If it is critical, run it outside the cluster or in a highly available, dedicated pool that doesn’t share resources with the apps it’s validating.

The Kubelet Eviction Death Spiral

By hour 60, I was seeing nodes marked as NotReady.

kubectl describe node showed MemoryPressure was true.

The Kubelet has a set of hard eviction thresholds. By default, it’s often memory.available < 100Mi. In v1.29.2, the Kubelet’s interaction with systemd-oomd can be complex. If you haven’t configured your kubelet flags correctly, it will start killing critical system processes before it kills the rogue Java app that’s actually eating the memory.

We hadn’t set --eviction-hard. We hadn’t set --system-reserved or --kube-reserved.

The Kubelet was fighting the OS for the last 500MB of RAM. The OS won. The Kubelet was killed. The node went NotReady. The scheduler saw the node was gone and tried to move all its pods to the other nodes, which were already at 90% capacity.

This is the “Thundering Herd” of Kubernetes. One node dies, its load kills the next node, and so on, until you have a data center full of expensive heaters that don’t do any work.

You must reserve resources for the system.

# kubelet-config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
systemReserved:
  cpu: "500m"
  memory: "1Gi"
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "10%"

This ensures the Kubelet and the underlying Linux kernel have enough breathing room to actually perform the evictions and housekeeping necessary to keep the node alive.

Networking: Cilium and the BPF Map Overflow

We use Cilium for CNI. It’s powerful. It’s also complex. The “YAML engineers” had created a NetworkPolicy for every single microservice, which is good. But they used fine-grained L7 rules (HTTP path filtering) for everything.

In a cluster with 10,000+ pods, this exploded the BPF maps.

level=warning msg="BPF map is full" subsystem=bpf-map-manager

The networking stack started dropping packets. Not all packets—just some. The most frustrating kind of failure. A 5% packet loss that looks like application latency.

The “YAML engineers” spent 12 hours debugging their code, blaming “slow Python libraries.” It wasn’t the code. It was the fact that the kernel’s BPF maps were overflowing because we had too many complex policies for the allocated map sizes.

We had to tune the Cilium configuration to increase the bpf-map-dynamic-size-ratio.

# cilium-config
bpf-map-dynamic-size-ratio: 0.0055

And we had to tell the engineers that they don’t need L7 filtering for a service that only talks to one other service on a single port. Use L4 policies where possible. It’s faster, it’s simpler, and it doesn’t break the kernel.

The Aftermath

The cluster is back. I’ve deleted the latest tags. I’ve set resource requests and limits. I’ve fixed the PDBs. I’ve moved etcd to its own disks.

The “YAML engineers” are complaining that I’ve “restricted their creativity” by implementing these “arbitrary rules.”

They don’t understand that the kernel isn’t arbitrary. The kernel is a cold, hard reality. Kubernetes is just a way to organize that reality, but if you ignore the fundamentals—memory, CPU, I/O, and networking—it will fail you.

This post-mortem isn’t just a record of a failure. It’s a warning. v1.29.2 is a powerful tool, but it requires respect. If you treat it like a toy, it will burn your house down.

I’m going to sleep now. If the pager goes off because someone changed a failurePolicy back to Fail, I’m not answering. I’m deleting their namespace. That’s my new “kubernetes best” practice.

Total uptime: 0.01% this week.
Total coffee consumed: 42 liters.
Total respect for “YAML engineers”: 0.

Go fix your manifests before I do it for you. With kubectl delete.

Explore more insights and best practices: