It’s 03:14 AM. The pager is screaming, and the cluster is a graveyard of CrashLoopBackOffs. Here is exactly how we got here and why your “standard” setup is a liability.
I’m staring at a terminal window that looks like a crime scene. The junior dev—let’s call him Dave, because it’s always a Dave—pushed a “minor” update to the checkout service. He said it was just a dependency bump. Now, the etcd cluster is gasping for air, the API server is timing out, and the load balancer is throwing 503s like it’s a sport.
This isn’t a “learning opportunity.” This is a failure of engineering discipline. You followed a Medium tutorial written by someone who has never managed a production cluster under load. You thought Kubernetes would “self-heal” your way out of bad architecture. It won’t. It will just automate the destruction of your infrastructure at a scale you can’t imagine.
$ kubectl get pods -n prod-checkout
NAME READY STATUS RESTARTS AGE
checkout-api-7f8d9b6c5-4x2z1 0/1 CrashLoopBackOff 42 (1m ago) 3h
checkout-api-7f8d9b6c5-9p0l2 0/1 OOMKilled 12 3h
checkout-worker-5dd5756b68-v9q2w 0/1 Pending 0 5m
$ kubectl get events --sort-by='.lastTimestamp' -n prod-checkout
3m12s Warning FailedScheduling pod/checkout-worker-5dd5756b68-v9q2w 0/3 nodes are available: 3 Insufficient memory.
2m45s Warning BackOff pod/checkout-api-7f8d9b6c5-4x2z1 Back-off restarting failed container
1m10s Normal Killing pod/checkout-api-7f8d9b6c5-9p0l2 Stopping container checkout-api
$ kubectl describe node gke-prod-pool-1-3a2b
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SystemOOM 5m kubelet System OOM encountered, victim process: checkout-api
The cluster is dead. The “kubernetes best” practices you ignored are now the reason I’m on my fourth cup of lukewarm sludge. Let’s perform the autopsy.
Table of Contents
Your Resource Limits are a Joke
What the tutorials tell you: “Just set some limits so your pods don’t run away with the node. Or don’t! Kubernetes is smart enough to balance it.”
The cold, hard reality of production: If you don’t define requests and limits with surgical precision, the kube-scheduler is flying blind. Dave’s “minor change” included a new library that pre-allocates 2GB of heap on startup. Because he didn’t update the manifest, the pod started with a request of 256MB. The scheduler saw a node with 512MB free and said, “Yeah, that fits.”
Then the pod actually tried to start. It hit the cgroup limit, the kernel OOMKiller woke up, and it executed the process. But it didn’t just kill the pod; because of the way cgroups v2 handles memory pressure in Kubernetes v1.29 and v1.30, the entire node started thrashing. We ended up with a “noisy neighbor” situation where the checkout service strangled the kubelet itself.
In a real environment, you use the Guaranteed Quality of Service (QoS) class for critical workloads. That means requests == limits. No overcommitting. No “burstable” nonsense for the core API.
# This is not a suggestion. This is a survival requirement.
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: prod-checkout
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
---
apiVersion: v1
kind: Pod
metadata:
name: checkout-api
spec:
containers:
- name: checkout-api
image: company/checkout:v1.2.3
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "1000m" # Guaranteed QoS class
If you don’t enforce ResourceQuotas at the namespace level, you are one git push away from a total cluster blackout. Dave didn’t know the node capacity. The ResourceQuota would have rejected his deployment at the API level before it ever hit the scheduler. Instead, we got a cascading failure.
The Namespace is Not a Security Boundary
What the tutorials tell you: “Use namespaces to organize your apps. It keeps things clean and secure.”
The cold, hard reality of production: A namespace is just a prefix in the API server. By default, every pod in your cluster can talk to every other pod. Dave’s “minor change” included a debugging tool that accidentally exposed a metrics endpoint on all interfaces. Because we didn’t have NetworkPolicies enforced, a compromised pod in the dev-sandbox namespace (which some genius linked to the same cluster) could have curled the internal checkout DB.
In Kubernetes v1.30, we have better support for AdminNetworkPolicy, but most of you are still running wide-open flat networks. You’re essentially running a 1990s-style LAN and wondering why you’re getting lateral movement during a breach.
# Default Deny All - Start from zero, you cowards.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: prod-checkout
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Only allow the API to talk to the DB
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-db-access
namespace: prod-checkout
spec:
podSelector:
matchLabels:
app: checkout-api
egress:
- to:
- podSelector:
matchLabels:
app: checkout-db
ports:
- protocol: TCP
port: 5432
Without this, your “blast radius” is the entire cluster. You’re one RCE away from losing the keys to the kingdom. We spent two hours of this outage just verifying that the CrashLoopBackOff wasn’t an active exfiltration attempt because our observability into internal traffic is garbage.
Your Liveness Probes are Killing Your Availability
What the tutorials tell you: “Use liveness probes to restart your app if it hangs. It’s the magic ‘fix-it’ button.”
The cold, hard reality of production: Liveness probes are a loaded gun pointed at your foot. Dave set a liveness probe to check the database connection. When the database got slow due to the increased load, the liveness probe failed. Kubernetes, being the obedient soldier it is, killed the pod.
This created a “death spiral.” The remaining pods took on the extra traffic, got even slower, failed their liveness probes, and were killed. Within 90 seconds, the entire deployment was down.
You use readinessProbes to control traffic flow. You use livenessProbes only to catch hard deadlocks that the application cannot recover from. And for the love of all that is holy, use startupProbes for apps that take a long time to initialize so you don’t kill them while they’re still loading their 500MB of Java classes.
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz/live # This should NOT check the DB. Only the process state.
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
If your liveness probe depends on an external dependency, you have built a distributed suicide machine.
etcd is Not a Magic Database
What the tutorials tell you: “Kubernetes stores its state in etcd. It’s highly available and consistent.”
The cold, hard reality of production: etcd is a finicky beast that demands high-performance IOPS and low latency. Dave’s “minor change” caused a massive spike in “drift” because he was using a controller that was aggressively updating labels on thousands of pods every second. This flooded the kube-apiserver with write requests.
The etcd write-ahead log (WAL) couldn’t keep up because we were running on standard persistent disks instead of local SSDs. Disk latency spiked to 50ms. Raft consensus started failing. The nodes started flapping between Ready and NotReady because the kubelet couldn’t update its heartbeat in time.
# How I knew we were screwed
$ kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
etcd-0 Unhealthy {"health":"false","reason":"remote error: tls: internal error"}
When etcd suffers, the whole world burns. You need to monitor etcd_disk_wal_fync_duration_seconds_bucket. If that p99 goes over 10ms, you are in the danger zone. We were at 150ms. The “kubernetes best” way to handle this is to isolate etcd on its own dedicated nodes with NVMe drives, but no, you wanted to save money by running it on the same general-purpose instances as your Jenkins runners.
The Fallacy of the “Stateless” Application
What the tutorials tell you: “Kubernetes is for stateless apps! Just scale them up and down!”
The cold, hard reality of production: Nothing is stateless. Everything has state—it’s just a question of where you’re hiding it. In our case, Dave’s app was “stateless” but relied on a PersistentVolume (PV) for a legacy file-processing module.
When the pods started crashing, the ReadWriteOnce (RWO) volume was stuck in a “Multi-Attach Error.” The old pod was dead but the cloud provider hadn’t released the volume attachment yet. The new pod couldn’t start because it couldn’t mount the disk.
$ kubectl describe pod checkout-api-7f8d9b6c5-4x2z1
Events:
Warning FailedMount 3m kubelet Unable to attach or mount volumes: timed out waiting for the condition
We spent 45 minutes manually detaching AWS EBS volumes in the console because the volume-attachment-manager in the controller-manager was backed up due to the etcd latency issues. If you’re going to use volumes, you need to understand the limitations of your CSI (Container Storage Interface) driver. You can’t just wish away the laws of physics and distributed systems.
SecurityContext is Not Optional
What the tutorials tell you: “Just run the container. It works on my machine!”
The cold, hard reality of production: Running as root is a death sentence. Dave’s “minor change” used a base image that defaulted to the root user. When the pod was compromised—or even just when it crashed—it had permissions it didn’t need.
In v1.29+, Pod Security Admissions are the standard. If you aren’t enforcing the restricted profile, you’re failing. I found three pods running with privileged: true because someone wanted to “debug a network issue” six months ago and never changed it back. That’s how you get container escapes. That’s how a “minor change” becomes a headline in the Wall Street Journal.
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: checkout-api
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
If your containers don’t have a readOnlyRootFilesystem, you’re inviting attackers to install their own toolsets. Dave’s app tried to write a temp file to /app/config and failed because I finally locked it down. That’s why it was crashing. He should have been using an emptyDir for temporary storage, but he was lazy.
The “Kubernetes Best” Practices are Written in Blood
You don’t implement these things because they’re “best.” You implement them because the alternative is what I’m doing right now: sitting in a cold server room, smelling like old coffee and regret, manually deleting finalizers from stuck ConfigMaps.
The “kubernetes best” approach isn’t about using the newest features in v1.30; it’s about respecting the complexity of the system. It’s about realizing that Kubernetes is a platform for building platforms, not a place to dump your unoptimized Docker images and hope for the best.
We’ve spent the last three days dealing with “toil”—manual, repetitive tasks that could have been avoided with proper automation and policy enforcement. We had “drift” between our staging and production environments because someone manually edited a deployment using kubectl edit instead of updating the Helm chart.
# Finding the drift that killed us
$ kubectl get deployment checkout-api -o yaml > live.yaml
$ helm get manifest checkout-release > helm.yaml
$ diff live.yaml helm.yaml
< image: company/checkout:v1.2.3-debug-DONT-PUSH
---
> image: company/checkout:v1.2.2
There it is. Dave pushed a debug image directly to the cluster. No CI/CD pipeline check. No admission controller to stop images with “DONT-PUSH” in the tag. Just raw, unadulterated incompetence.
Checklist for the Uninitiated
If you want to avoid being the reason I’m awake at 3 AM, you will follow this checklist. This isn’t a suggestion. It’s a mandate.
- QoS or GTFO: Every production pod must have
requestsandlimitsdefined. Critical services must use theGuaranteedclass (requests == limits). - Default Deny: Implement a
NetworkPolicythat denies all traffic by default. Explicitly whitelist every single connection. If you don’t know what your app talks to, you don’t know your app. - Probes are for Health, Not Dependencies: Liveness probes check the process. Readiness probes check the ability to serve traffic. Never, ever point a liveness probe at a database.
- No Root, Ever: Use
PodSecurityContextto run as a non-root user. SetallowPrivilegeEscalation: false. If your app “needs” root, your app is broken. - Monitor etcd Like Your Life Depends On It: Because it does. Watch your
fsynclatency. Use dedicated, fast storage. - Immutable Infrastructure: If you use
kubectl editon a production resource, I will find you. Everything goes through Git. Use a tool like ArgoCD or Flux to detect and remediate drift automatically. - Pod Disruption Budgets (PDBs): If you’re running more than one replica (and you should be), you need a PDB. This prevents the cluster autoscaler or a node upgrade from taking down all your pods at once.
- TerminationGracePeriodSeconds: Give your app enough time to shut down gracefully. If your app takes 45 seconds to drain connections, don’t leave the default at 30.
- Use Admission Controllers: Implement
ValidatingAdmissionWebhooksto reject any manifest that doesn’t meet these standards. Don’t trust humans. Humans are the problem. - Version Pinning: Pin your images to a digest (SHA), not a tag.
v1.2.3can be overwritten. A SHA is forever.
The sun is coming up. The cluster is stable, mostly because I’ve manually scaled the checkout service to zero to let etcd recover. Dave is going to have a very long meeting with me in two hours.
Kubernetes is a powerful tool, but in the hands of the “standard” user, it’s just a very expensive way to fail. Go back to basics. Fix your manifests. Stop bikeshedding about which service mesh to use and start worrying about your cgroup limits.
Now, if you’ll excuse me, I need to find a place to sleep that doesn’t vibrate at the frequency of a server rack.
Related Articles
Explore more insights and best practices: