Kubernetes Best Practices: Optimize Your Clusters Today

It is 4:12 AM. My eyes feel like someone rubbed them with industrial-grade sandpaper. The third pot of coffee tastes like battery acid and failed dreams. I’ve been staring at a Grafana dashboard that looks like a heart monitor of a patient in mid-arrest for the last three days.

Why? Because a “senior” developer decided that Friday at 4:45 PM was the perfect time to push a Helm chart update using Helm v3.14 to our Kubernetes 1.30 production cluster. They called it a “minor tweak.” I call it a digital pipe bomb.

If you are reading this, you are likely either an SRE looking for a reason to quit or a developer who just realized their “simple” YAML change is currently costing the company $15,000 a minute in lost revenue. Sit down. Shut up. Read this. I’m going to walk you through the anatomy of this cluster collapse so you never have to feel the vibration of a PagerDuty alert at 3:14 AM ever again.

Table of Contents

The Incident Context: The Friday Afternoon Pipe Bomb

The disaster started with a single helm upgrade --install. We’re running Kubernetes 1.30, the latest stable, thinking the new features would save us from ourselves. We were wrong. The deployment was for a “critical” microservice—let’s call it payment-gateway-v2.

The developer didn’t check the existing state. They didn’t run a helm diff. They just fired the command and went to happy hour. Within six minutes, the API server started lagging. Within ten, the nodes began reporting NotReady.

Here is what the carnage looked like on the ground. This is the raw output from kubectl get events --all-namespaces --sort-by='.lastTimestamp' shortly after the first wave of failures:

LAST SEEN   TYPE      REASON             OBJECT                                  MESSAGE
12s         Warning   Unhealthy          pod/payment-gateway-v2-7489cf8d-abcde   Readiness probe failed: HTTP probe failed with statuscode: 503
8s          Warning   BackOff            pod/payment-gateway-v2-7489cf8d-abcde   Back-off restarting failed container
5s          Normal    Scheduled          pod/payment-gateway-v2-7489cf8d-fghij   Successfully assigned payment-gateway-v2 to node-04
2s          Warning   FailedScheduling   pod/payment-gateway-v2-7489cf8d-klmno   0/12 nodes are available: 12 Insufficient cpu, 12 Insufficient memory.
1s          Warning   OOMKilling         node/node-04                            System OOM encountered, victim process: kubelet

The cluster wasn’t just failing; it was eating itself. The scheduler was trying to cram pods into nodes that were already gasping for air. This is what happens when you ignore kubernetes best practices because you think you’re smarter than the scheduler. You aren’t.

Resource Limits and the OOMKiller: The Silent Executioner

The first domino to fall was the lack of defined resource requests and limits. In Kubernetes 1.30, the scheduler is more efficient, but it’s not a psychic. If you don’t tell it how much memory a pod needs, it assumes “not much.”

The payment-gateway-v2 pod had no resources block. It started up, saw 64GB of RAM on the node, and decided to cache the entire production database in-memory. The Linux OOM (Out Of Memory) Killer didn’t just kill the pod; it got confused by the rapid memory pressure and started killing critical system processes, including the kubelet and the CNI plugin.

When the kubelet dies, the node goes NotReady. When the node goes NotReady, the control plane tries to reschedule those pods onto other nodes. Those nodes were already at 90% capacity. It was a circular firing squad.

To prevent this, you need a ResourceQuota in every namespace. No exceptions. If a developer tries to deploy a pod without limits, the API server should slap their hand away.

The Fix: ResourceQuota Manifest

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 100Gi
    limits.cpu: "40"
    limits.memory: 200Gi
    pods: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 250m
      memory: 256Mi
    type: Container

If we had this in place, the “minor tweak” would have failed at the CI/CD stage. Instead, I spent four hours manually killing zombie processes on bare-metal nodes. Use LimitRange to force a default. It’s the only way to survive the “I forgot to add limits” excuse.

Probes that Lie: The Liveness Probe Death Spiral

Once we got the nodes back online, the next wave of the nightmare began. The developer had configured a livenessProbe that pointed to the same /healthz endpoint as the readinessProbe.

Here is the kubectl describe pod output for one of the failing pods:

Containers:
  gateway:
    Image:       our-registry.io/payment-gateway:v2.1.4
    Port:        8080/TCP
    Liveness:    http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:   http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
State:          Running
  Last State:   Terminated
    Reason:     Error
    Exit Code:  137
    Started:    Mon, 14 Oct 2024 02:10:15 +0000
    Finished:   Mon, 14 Oct 2024 02:12:45 +0000
Ready:          False
Restart Count:  14

Exit code 137. That’s a SIGKILL. The pod was being murdered by the kubelet.

Why? Because the /healthz endpoint was checking the database connection. The database was under heavy load because of the previous node failures. The response took 1.1 seconds. The timeoutSeconds was set to 1.

The readinessProbe failed, so the pod was removed from the Service load balancer. That’s fine. But the livenessProbe also failed. Kubernetes thought the pod was dead, so it killed it and started a new one. This created a “thundering herd” effect. Every time a pod started, it tried to initialize, hit the database, timed out, and got killed. We were spending more CPU cycles on container startup than on actual traffic.

The Fix: Intelligent Probes

Stop using the same endpoint for both. Liveness should only fail if the process is actually unrecoverable (e.g., a deadlock). Readiness should be used to manage traffic flow. In K8s 1.30, you should also be using startupProbes for legacy apps that take a long time to warm up.

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 5 # Be gentle
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  successThreshold: 2 # Ensure it's actually stable
startupProbe:
  httpGet:
    path: /init
    port: 8080
  failureThreshold: 30
  periodSeconds: 10 # Give it 5 minutes to start if needed

The RBAC Nightmare: How a ServiceAccount Nuked a Namespace

While I was fighting the OOMKiller, a junior dev tried to “help” by running a cleanup script. This script used a ServiceAccount that had been granted cluster-admin privileges six months ago because “we were in a rush and couldn’t figure out the specific permissions.”

The script had a bug. It was supposed to run kubectl delete pod -l app=old-version, but due to a shell scripting error involving an empty variable, it effectively executed something equivalent to a broad delete across the namespace.

Because the ServiceAccount was over-privileged, it didn’t just delete pods. It started deleting ConfigMaps, Secrets, and eventually, the PersistentVolumeClaims.

The etcd database—the brain of your cluster—started screaming. We saw etcd_server_slow_apply_total metrics spiking. When you delete thousands of objects at once, etcd has to process those deletions and replicate them across the quorum. Our etcd nodes were running on standard SSDs, not NVMe. The disk I/O wait climbed to 40%. The API server became unresponsive. The cluster was brain-dead.

The Fix: Least Privilege RBAC

You do not need cluster-admin. Your pods do not need to talk to the API server unless they are specifically designed for cluster management.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-processor-sa
  namespace: production
automountServiceAccountToken: false # DO NOT MOUNT THIS BY DEFAULT
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: pod-viewer
spec:
  rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: production
subjects:
- kind: ServiceAccount
  name: payment-processor-sa
roleRef:
  kind: Role
  name: pod-viewer
  apiGroup: rbac.authorization.k8s.io

Set automountServiceAccountToken: false. It prevents a compromised pod from having an easy path to lateral movement. If I find another cluster-admin role in our dev namespace, I’m revoking everyone’s SSH keys.

Traffic Management and Network Policies: The Killer in the Dark

By hour 48, we had the pods running and the RBAC locked down. Then the network died.

Our cluster uses a flat network (standard CNI behavior). A developer in the staging namespace was running a load test. Because we had no NetworkPolicies, their staging traffic was able to route directly to our production Redis instance.

They flooded the Redis connection pool. Production pods started throwing ConnectionPoolTimeoutException. Because the production pods couldn’t talk to Redis, they failed their readinessProbes (see the previous section on why that’s a disaster).

The CNI plugin (Calico in our case) started dropping packets because the conntrack table on the nodes was full. We were seeing 200,000+ entries in sysctl net.netfilter.nf_conntrack_count.

The Fix: Default Deny Network Policies

Kubernetes is open by default. That is a security and stability nightmare. You must implement a “default-deny” policy and then explicitly allow the traffic you expect. This is kubernetes best practice for a reason.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-redis-from-app
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: redis
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: payment-gateway
    ports:
    - protocol: TCP
      port: 6379

This policy would have blocked the staging load test from ever touching production. It also prevents a compromised frontend pod from scanning your internal network for other vulnerabilities.

The Long Road to Recovery: Etcd and the Aftermath

The final 24 hours were spent in etcd hell. When the cluster collapsed, the etcd database became fragmented. Even after we stopped the bleeding, the API server was slow.

I had to perform an etcdctl defrag on each member of the cluster. If you’ve never done this on a live production cluster while your boss is breathing down your neck, consider yourself lucky.

Here is the sequence of commands that saved our skin, run from within the etcd pod:

export ETCDCTL_API=3
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=table

# Defragmenting to reclaim space and improve latency
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  defrag

After the defrag, the backend_commit_duration_seconds dropped from 150ms to 5ms. The cluster finally felt snappy again.

But we shouldn’t have been there. We should have had PodDisruptionBudgets (PDBs) to ensure that even during a mass failure, a certain percentage of our pods remained available.

The Fix: PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-gateway-pdb
  namespace: production
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: payment-gateway

This ensures that the eviction API will refuse to kill pods if it drops the count below 3. It’s a safety net for your availability.

The Hard Truth

We recovered. The cluster is stable. The developer who pushed the change is currently writing a 50-page apology in the form of documentation.

But here’s the reality: Kubernetes is a complex system of distributed state. It is not a “set it and forget it” platform. If you aren’t using tools like kube-score or Polaris to lint your manifests before they hit the cluster, you are playing Russian Roulette with five chambers loaded.

We’ve now integrated kube-score into our CI/CD pipeline. If a Helm chart doesn’t have resource limits, proper probes, and a PDB, the build fails. No human intervention required.

The next time someone tells you that Kubernetes is “easy,” show them this report. Show them the OOMKilled logs. Show them the etcd latency graphs.

I’m going to sleep now. If my pager goes off because you decided to ignore a ResourceQuota, don’t bother calling me. Just start updating your resume.

Final Advice: Your cluster is a reflection of your discipline. If your YAML is a mess, your uptime will be too. Stop looking for “thought leadership” and start reading the damn documentation. Kubernetes 1.30 has everything you need to stay stable, but it won’t save you from your own laziness.

Stay paranoid. Check your limits. And for the love of all that is holy, never deploy on a Friday.

Explore more insights and best practices: