It is 4:12 AM. My eyes feel like someone rubbed them with industrial-grade sandpaper. The third pot of coffee tastes like battery acid and failed dreams. I’ve been staring at a Grafana dashboard that looks like a heart monitor of a patient in mid-arrest for the last three days.
Why? Because a “senior” developer decided that Friday at 4:45 PM was the perfect time to push a Helm chart update using Helm v3.14 to our Kubernetes 1.30 production cluster. They called it a “minor tweak.” I call it a digital pipe bomb.
If you are reading this, you are likely either an SRE looking for a reason to quit or a developer who just realized their “simple” YAML change is currently costing the company $15,000 a minute in lost revenue. Sit down. Shut up. Read this. I’m going to walk you through the anatomy of this cluster collapse so you never have to feel the vibration of a PagerDuty alert at 3:14 AM ever again.
Table of Contents
The Incident Context: The Friday Afternoon Pipe Bomb
The disaster started with a single helm upgrade --install. We’re running Kubernetes 1.30, the latest stable, thinking the new features would save us from ourselves. We were wrong. The deployment was for a “critical” microservice—let’s call it payment-gateway-v2.
The developer didn’t check the existing state. They didn’t run a helm diff. They just fired the command and went to happy hour. Within six minutes, the API server started lagging. Within ten, the nodes began reporting NotReady.
Here is what the carnage looked like on the ground. This is the raw output from kubectl get events --all-namespaces --sort-by='.lastTimestamp' shortly after the first wave of failures:
LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning Unhealthy pod/payment-gateway-v2-7489cf8d-abcde Readiness probe failed: HTTP probe failed with statuscode: 503
8s Warning BackOff pod/payment-gateway-v2-7489cf8d-abcde Back-off restarting failed container
5s Normal Scheduled pod/payment-gateway-v2-7489cf8d-fghij Successfully assigned payment-gateway-v2 to node-04
2s Warning FailedScheduling pod/payment-gateway-v2-7489cf8d-klmno 0/12 nodes are available: 12 Insufficient cpu, 12 Insufficient memory.
1s Warning OOMKilling node/node-04 System OOM encountered, victim process: kubelet
The cluster wasn’t just failing; it was eating itself. The scheduler was trying to cram pods into nodes that were already gasping for air. This is what happens when you ignore kubernetes best practices because you think you’re smarter than the scheduler. You aren’t.
Resource Limits and the OOMKiller: The Silent Executioner
The first domino to fall was the lack of defined resource requests and limits. In Kubernetes 1.30, the scheduler is more efficient, but it’s not a psychic. If you don’t tell it how much memory a pod needs, it assumes “not much.”
The payment-gateway-v2 pod had no resources block. It started up, saw 64GB of RAM on the node, and decided to cache the entire production database in-memory. The Linux OOM (Out Of Memory) Killer didn’t just kill the pod; it got confused by the rapid memory pressure and started killing critical system processes, including the kubelet and the CNI plugin.
When the kubelet dies, the node goes NotReady. When the node goes NotReady, the control plane tries to reschedule those pods onto other nodes. Those nodes were already at 90% capacity. It was a circular firing squad.
To prevent this, you need a ResourceQuota in every namespace. No exceptions. If a developer tries to deploy a pod without limits, the API server should slap their hand away.
The Fix: ResourceQuota Manifest
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: 100Gi
limits.cpu: "40"
limits.memory: 200Gi
pods: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 250m
memory: 256Mi
type: Container
If we had this in place, the “minor tweak” would have failed at the CI/CD stage. Instead, I spent four hours manually killing zombie processes on bare-metal nodes. Use LimitRange to force a default. It’s the only way to survive the “I forgot to add limits” excuse.
Probes that Lie: The Liveness Probe Death Spiral
Once we got the nodes back online, the next wave of the nightmare began. The developer had configured a livenessProbe that pointed to the same /healthz endpoint as the readinessProbe.
Here is the kubectl describe pod output for one of the failing pods:
Containers:
gateway:
Image: our-registry.io/payment-gateway:v2.1.4
Port: 8080/TCP
Liveness: http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
State: Running
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 14 Oct 2024 02:10:15 +0000
Finished: Mon, 14 Oct 2024 02:12:45 +0000
Ready: False
Restart Count: 14
Exit code 137. That’s a SIGKILL. The pod was being murdered by the kubelet.
Why? Because the /healthz endpoint was checking the database connection. The database was under heavy load because of the previous node failures. The response took 1.1 seconds. The timeoutSeconds was set to 1.
The readinessProbe failed, so the pod was removed from the Service load balancer. That’s fine. But the livenessProbe also failed. Kubernetes thought the pod was dead, so it killed it and started a new one. This created a “thundering herd” effect. Every time a pod started, it tried to initialize, hit the database, timed out, and got killed. We were spending more CPU cycles on container startup than on actual traffic.
The Fix: Intelligent Probes
Stop using the same endpoint for both. Liveness should only fail if the process is actually unrecoverable (e.g., a deadlock). Readiness should be used to manage traffic flow. In K8s 1.30, you should also be using startupProbes for legacy apps that take a long time to warm up.
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 5 # Be gentle
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 2 # Ensure it's actually stable
startupProbe:
httpGet:
path: /init
port: 8080
failureThreshold: 30
periodSeconds: 10 # Give it 5 minutes to start if needed
The RBAC Nightmare: How a ServiceAccount Nuked a Namespace
While I was fighting the OOMKiller, a junior dev tried to “help” by running a cleanup script. This script used a ServiceAccount that had been granted cluster-admin privileges six months ago because “we were in a rush and couldn’t figure out the specific permissions.”
The script had a bug. It was supposed to run kubectl delete pod -l app=old-version, but due to a shell scripting error involving an empty variable, it effectively executed something equivalent to a broad delete across the namespace.
Because the ServiceAccount was over-privileged, it didn’t just delete pods. It started deleting ConfigMaps, Secrets, and eventually, the PersistentVolumeClaims.
The etcd database—the brain of your cluster—started screaming. We saw etcd_server_slow_apply_total metrics spiking. When you delete thousands of objects at once, etcd has to process those deletions and replicate them across the quorum. Our etcd nodes were running on standard SSDs, not NVMe. The disk I/O wait climbed to 40%. The API server became unresponsive. The cluster was brain-dead.
The Fix: Least Privilege RBAC
You do not need cluster-admin. Your pods do not need to talk to the API server unless they are specifically designed for cluster management.
apiVersion: v1
kind: ServiceAccount
metadata:
name: payment-processor-sa
namespace: production
automountServiceAccountToken: false # DO NOT MOUNT THIS BY DEFAULT
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-viewer
spec:
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: production
subjects:
- kind: ServiceAccount
name: payment-processor-sa
roleRef:
kind: Role
name: pod-viewer
apiGroup: rbac.authorization.k8s.io
Set automountServiceAccountToken: false. It prevents a compromised pod from having an easy path to lateral movement. If I find another cluster-admin role in our dev namespace, I’m revoking everyone’s SSH keys.
Traffic Management and Network Policies: The Killer in the Dark
By hour 48, we had the pods running and the RBAC locked down. Then the network died.
Our cluster uses a flat network (standard CNI behavior). A developer in the staging namespace was running a load test. Because we had no NetworkPolicies, their staging traffic was able to route directly to our production Redis instance.
They flooded the Redis connection pool. Production pods started throwing ConnectionPoolTimeoutException. Because the production pods couldn’t talk to Redis, they failed their readinessProbes (see the previous section on why that’s a disaster).
The CNI plugin (Calico in our case) started dropping packets because the conntrack table on the nodes was full. We were seeing 200,000+ entries in sysctl net.netfilter.nf_conntrack_count.
The Fix: Default Deny Network Policies
Kubernetes is open by default. That is a security and stability nightmare. You must implement a “default-deny” policy and then explicitly allow the traffic you expect. This is kubernetes best practice for a reason.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-redis-from-app
namespace: production
spec:
podSelector:
matchLabels:
app: redis
ingress:
- from:
- podSelector:
matchLabels:
app: payment-gateway
ports:
- protocol: TCP
port: 6379
This policy would have blocked the staging load test from ever touching production. It also prevents a compromised frontend pod from scanning your internal network for other vulnerabilities.
The Long Road to Recovery: Etcd and the Aftermath
The final 24 hours were spent in etcd hell. When the cluster collapsed, the etcd database became fragmented. Even after we stopped the bleeding, the API server was slow.
I had to perform an etcdctl defrag on each member of the cluster. If you’ve never done this on a live production cluster while your boss is breathing down your neck, consider yourself lucky.
Here is the sequence of commands that saved our skin, run from within the etcd pod:
export ETCDCTL_API=3
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --write-out=table
# Defragmenting to reclaim space and improve latency
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defrag
After the defrag, the backend_commit_duration_seconds dropped from 150ms to 5ms. The cluster finally felt snappy again.
But we shouldn’t have been there. We should have had PodDisruptionBudgets (PDBs) to ensure that even during a mass failure, a certain percentage of our pods remained available.
The Fix: PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-gateway-pdb
namespace: production
spec:
minAvailable: 3
selector:
matchLabels:
app: payment-gateway
This ensures that the eviction API will refuse to kill pods if it drops the count below 3. It’s a safety net for your availability.
The Hard Truth
We recovered. The cluster is stable. The developer who pushed the change is currently writing a 50-page apology in the form of documentation.
But here’s the reality: Kubernetes is a complex system of distributed state. It is not a “set it and forget it” platform. If you aren’t using tools like kube-score or Polaris to lint your manifests before they hit the cluster, you are playing Russian Roulette with five chambers loaded.
We’ve now integrated kube-score into our CI/CD pipeline. If a Helm chart doesn’t have resource limits, proper probes, and a PDB, the build fails. No human intervention required.
The next time someone tells you that Kubernetes is “easy,” show them this report. Show them the OOMKilled logs. Show them the etcd latency graphs.
I’m going to sleep now. If my pager goes off because you decided to ignore a ResourceQuota, don’t bother calling me. Just start updating your resume.
Final Advice: Your cluster is a reflection of your discipline. If your YAML is a mess, your uptime will be too. Stop looking for “thought leadership” and start reading the damn documentation. Kubernetes 1.30 has everything you need to stay stable, but it won’t save you from your own laziness.
Stay paranoid. Check your limits. And for the love of all that is holy, never deploy on a Friday.
Related Articles
Explore more insights and best practices: