{"id":4734,"date":"2026-03-14T21:06:32","date_gmt":"2026-03-14T15:36:32","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/"},"modified":"2026-03-14T21:06:32","modified_gmt":"2026-03-14T15:36:32","slug":"kubernetes-best-practices-optimize-your-clusters-today","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/","title":{"rendered":"Kubernetes Best Practices: Optimize Your Clusters Today"},"content":{"rendered":"<p>It is 4:12 AM. My eyes feel like someone rubbed them with industrial-grade sandpaper. The third pot of coffee tastes like battery acid and failed dreams. I\u2019ve been staring at a Grafana dashboard that looks like a heart monitor of a patient in mid-arrest for the last three days. <\/p>\n<p>Why? Because a &#8220;senior&#8221; developer decided that Friday at 4:45 PM was the perfect time to push a Helm chart update using Helm v3.14 to our Kubernetes 1.30 production cluster. They called it a &#8220;minor tweak.&#8221; I call it a digital pipe bomb.<\/p>\n<p>If you are reading this, you are likely either an SRE looking for a reason to quit or a developer who just realized their &#8220;simple&#8221; YAML change is currently costing the company $15,000 a minute in lost revenue. Sit down. Shut up. Read this. I\u2019m going to walk you through the anatomy of this cluster collapse so you never have to feel the vibration of a PagerDuty alert at 3:14 AM ever again.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69d32f7b84e46\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69d32f7b84e46\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Incident_Context_The_Friday_Afternoon_Pipe_Bomb\" >The Incident Context: The Friday Afternoon Pipe Bomb<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#Resource_Limits_and_the_OOMKiller_The_Silent_Executioner\" >Resource Limits and the OOMKiller: The Silent Executioner<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Fix_ResourceQuota_Manifest\" >The Fix: ResourceQuota Manifest<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#Probes_that_Lie_The_Liveness_Probe_Death_Spiral\" >Probes that Lie: The Liveness Probe Death Spiral<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Fix_Intelligent_Probes\" >The Fix: Intelligent Probes<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_RBAC_Nightmare_How_a_ServiceAccount_Nuked_a_Namespace\" >The RBAC Nightmare: How a ServiceAccount Nuked a Namespace<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Fix_Least_Privilege_RBAC\" >The Fix: Least Privilege RBAC<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#Traffic_Management_and_Network_Policies_The_Killer_in_the_Dark\" >Traffic Management and Network Policies: The Killer in the Dark<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Fix_Default_Deny_Network_Policies\" >The Fix: Default Deny Network Policies<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Long_Road_to_Recovery_Etcd_and_the_Aftermath\" >The Long Road to Recovery: Etcd and the Aftermath<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Fix_PodDisruptionBudget\" >The Fix: PodDisruptionBudget<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#The_Hard_Truth\" >The Hard Truth<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"The_Incident_Context_The_Friday_Afternoon_Pipe_Bomb\"><\/span>The Incident Context: The Friday Afternoon Pipe Bomb<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The disaster started with a single <code>helm upgrade --install<\/code>. We\u2019re running Kubernetes 1.30, the latest stable, thinking the new features would save us from ourselves. We were wrong. The deployment was for a &#8220;critical&#8221; microservice\u2014let\u2019s call it <code>payment-gateway-v2<\/code>. <\/p>\n<p>The developer didn&#8217;t check the existing state. They didn&#8217;t run a <code>helm diff<\/code>. They just fired the command and went to happy hour. Within six minutes, the API server started lagging. Within ten, the nodes began reporting <code>NotReady<\/code>. <\/p>\n<p>Here is what the carnage looked like on the ground. This is the raw output from <code>kubectl get events --all-namespaces --sort-by='.lastTimestamp'<\/code> shortly after the first wave of failures:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">LAST SEEN   TYPE      REASON             OBJECT                                  MESSAGE\n12s         Warning   Unhealthy          pod\/payment-gateway-v2-7489cf8d-abcde   Readiness probe failed: HTTP probe failed with statuscode: 503\n8s          Warning   BackOff            pod\/payment-gateway-v2-7489cf8d-abcde   Back-off restarting failed container\n5s          Normal    Scheduled          pod\/payment-gateway-v2-7489cf8d-fghij   Successfully assigned payment-gateway-v2 to node-04\n2s          Warning   FailedScheduling   pod\/payment-gateway-v2-7489cf8d-klmno   0\/12 nodes are available: 12 Insufficient cpu, 12 Insufficient memory.\n1s          Warning   OOMKilling         node\/node-04                            System OOM encountered, victim process: kubelet\n<\/code><\/pre>\n<p>The cluster wasn&#8217;t just failing; it was eating itself. The scheduler was trying to cram pods into nodes that were already gasping for air. This is what happens when you ignore <strong>kubernetes best<\/strong> practices because you think you\u2019re smarter than the scheduler. You aren&#8217;t.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Resource_Limits_and_the_OOMKiller_The_Silent_Executioner\"><\/span>Resource Limits and the OOMKiller: The Silent Executioner<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The first domino to fall was the lack of defined resource requests and limits. In Kubernetes 1.30, the scheduler is more efficient, but it\u2019s not a psychic. If you don\u2019t tell it how much memory a pod needs, it assumes &#8220;not much.&#8221; <\/p>\n<p>The <code>payment-gateway-v2<\/code> pod had no <code>resources<\/code> block. It started up, saw 64GB of RAM on the node, and decided to cache the entire production database in-memory. The Linux OOM (Out Of Memory) Killer didn&#8217;t just kill the pod; it got confused by the rapid memory pressure and started killing critical system processes, including the <code>kubelet<\/code> and the CNI plugin.<\/p>\n<p>When the <code>kubelet<\/code> dies, the node goes <code>NotReady<\/code>. When the node goes <code>NotReady<\/code>, the control plane tries to reschedule those pods onto <em>other<\/em> nodes. Those nodes were already at 90% capacity. It was a circular firing squad.<\/p>\n<p>To prevent this, you need a <code>ResourceQuota<\/code> in every namespace. No exceptions. If a developer tries to deploy a pod without limits, the API server should slap their hand away.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The_Fix_ResourceQuota_Manifest\"><\/span>The Fix: ResourceQuota Manifest<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: v1\nkind: ResourceQuota\nmetadata:\n  name: compute-resources\n  namespace: production\nspec:\n  hard:\n    requests.cpu: &quot;20&quot;\n    requests.memory: 100Gi\n    limits.cpu: &quot;40&quot;\n    limits.memory: 200Gi\n    pods: &quot;50&quot;\n---\napiVersion: v1\nkind: LimitRange\nmetadata:\n  name: default-limits\n  namespace: production\nspec:\n  limits:\n  - default:\n      cpu: 500m\n      memory: 512Mi\n    defaultRequest:\n      cpu: 250m\n      memory: 256Mi\n    type: Container\n<\/code><\/pre>\n<p>If we had this in place, the &#8220;minor tweak&#8221; would have failed at the CI\/CD stage. Instead, I spent four hours manually killing zombie processes on bare-metal nodes. Use <code>LimitRange<\/code> to force a default. It\u2019s the only way to survive the &#8220;I forgot to add limits&#8221; excuse.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Probes_that_Lie_The_Liveness_Probe_Death_Spiral\"><\/span>Probes that Lie: The Liveness Probe Death Spiral<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Once we got the nodes back online, the next wave of the nightmare began. The developer had configured a <code>livenessProbe<\/code> that pointed to the same <code>\/healthz<\/code> endpoint as the <code>readinessProbe<\/code>. <\/p>\n<p>Here is the <code>kubectl describe pod<\/code> output for one of the failing pods:<\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">Containers:\n  gateway:\n    Image:       our-registry.io\/payment-gateway:v2.1.4\n    Port:        8080\/TCP\n    Liveness:    http-get http:\/\/:8080\/healthz delay=5s timeout=1s period=10s #success=1 #failure=3\n    Readiness:   http-get http:\/\/:8080\/healthz delay=5s timeout=1s period=10s #success=1 #failure=3\nState:          Running\n  Last State:   Terminated\n    Reason:     Error\n    Exit Code:  137\n    Started:    Mon, 14 Oct 2024 02:10:15 +0000\n    Finished:   Mon, 14 Oct 2024 02:12:45 +0000\nReady:          False\nRestart Count:  14\n<\/code><\/pre>\n<p>Exit code 137. That\u2019s a SIGKILL. The pod was being murdered by the <code>kubelet<\/code>. <\/p>\n<p>Why? Because the <code>\/healthz<\/code> endpoint was checking the database connection. The database was under heavy load because of the previous node failures. The response took 1.1 seconds. The <code>timeoutSeconds<\/code> was set to 1. <\/p>\n<p>The <code>readinessProbe<\/code> failed, so the pod was removed from the Service load balancer. That\u2019s fine. But the <code>livenessProbe<\/code> <em>also<\/em> failed. Kubernetes thought the pod was dead, so it killed it and started a new one. This created a &#8220;thundering herd&#8221; effect. Every time a pod started, it tried to initialize, hit the database, timed out, and got killed. We were spending more CPU cycles on container startup than on actual traffic.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The_Fix_Intelligent_Probes\"><\/span>The Fix: Intelligent Probes<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Stop using the same endpoint for both. Liveness should only fail if the process is actually unrecoverable (e.g., a deadlock). Readiness should be used to manage traffic flow. In K8s 1.30, you should also be using <code>startupProbes<\/code> for legacy apps that take a long time to warm up.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">livenessProbe:\n  httpGet:\n    path: \/live\n    port: 8080\n  initialDelaySeconds: 15\n  periodSeconds: 20\n  failureThreshold: 5 # Be gentle\nreadinessProbe:\n  httpGet:\n    path: \/ready\n    port: 8080\n  initialDelaySeconds: 5\n  periodSeconds: 5\n  successThreshold: 2 # Ensure it's actually stable\nstartupProbe:\n  httpGet:\n    path: \/init\n    port: 8080\n  failureThreshold: 30\n  periodSeconds: 10 # Give it 5 minutes to start if needed\n<\/code><\/pre>\n<h2><span class=\"ez-toc-section\" id=\"The_RBAC_Nightmare_How_a_ServiceAccount_Nuked_a_Namespace\"><\/span>The RBAC Nightmare: How a ServiceAccount Nuked a Namespace<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>While I was fighting the OOMKiller, a junior dev tried to &#8220;help&#8221; by running a cleanup script. This script used a ServiceAccount that had been granted <code>cluster-admin<\/code> privileges six months ago because &#8220;we were in a rush and couldn&#8217;t figure out the specific permissions.&#8221;<\/p>\n<p>The script had a bug. It was supposed to run <code>kubectl delete pod -l app=old-version<\/code>, but due to a shell scripting error involving an empty variable, it effectively executed something equivalent to a broad delete across the namespace. <\/p>\n<p>Because the ServiceAccount was over-privileged, it didn&#8217;t just delete pods. It started deleting ConfigMaps, Secrets, and eventually, the <code>PersistentVolumeClaims<\/code>. <\/p>\n<p>The etcd database\u2014the brain of your cluster\u2014started screaming. We saw <code>etcd_server_slow_apply_total<\/code> metrics spiking. When you delete thousands of objects at once, etcd has to process those deletions and replicate them across the quorum. Our etcd nodes were running on standard SSDs, not NVMe. The disk I\/O wait climbed to 40%. The API server became unresponsive. The cluster was brain-dead.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The_Fix_Least_Privilege_RBAC\"><\/span>The Fix: Least Privilege RBAC<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>You do not need <code>cluster-admin<\/code>. Your pods do not need to talk to the API server unless they are specifically designed for cluster management. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: v1\nkind: ServiceAccount\nmetadata:\n  name: payment-processor-sa\n  namespace: production\nautomountServiceAccountToken: false # DO NOT MOUNT THIS BY DEFAULT\n---\napiVersion: rbac.authorization.k8s.io\/v1\nkind: Role\nmetadata:\n  namespace: production\n  name: pod-viewer\nspec:\n  rules:\n  - apiGroups: [&quot;&quot;]\n    resources: [&quot;pods&quot;]\n    verbs: [&quot;get&quot;, &quot;list&quot;, &quot;watch&quot;]\n---\napiVersion: rbac.authorization.k8s.io\/v1\nkind: RoleBinding\nmetadata:\n  name: read-pods\n  namespace: production\nsubjects:\n- kind: ServiceAccount\n  name: payment-processor-sa\nroleRef:\n  kind: Role\n  name: pod-viewer\n  apiGroup: rbac.authorization.k8s.io\n<\/code><\/pre>\n<p>Set <code>automountServiceAccountToken: false<\/code>. It prevents a compromised pod from having an easy path to lateral movement. If I find another <code>cluster-admin<\/code> role in our dev namespace, I\u2019m revoking everyone\u2019s SSH keys.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Traffic_Management_and_Network_Policies_The_Killer_in_the_Dark\"><\/span>Traffic Management and Network Policies: The Killer in the Dark<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>By hour 48, we had the pods running and the RBAC locked down. Then the network died. <\/p>\n<p>Our cluster uses a flat network (standard CNI behavior). A developer in the <code>staging<\/code> namespace was running a load test. Because we had no <code>NetworkPolicies<\/code>, their staging traffic was able to route directly to our production Redis instance. <\/p>\n<p>They flooded the Redis connection pool. Production pods started throwing <code>ConnectionPoolTimeoutException<\/code>. Because the production pods couldn&#8217;t talk to Redis, they failed their <code>readinessProbes<\/code> (see the previous section on why that\u2019s a disaster). <\/p>\n<p>The CNI plugin (Calico in our case) started dropping packets because the <code>conntrack<\/code> table on the nodes was full. We were seeing 200,000+ entries in <code>sysctl net.netfilter.nf_conntrack_count<\/code>. <\/p>\n<h3><span class=\"ez-toc-section\" id=\"The_Fix_Default_Deny_Network_Policies\"><\/span>The Fix: Default Deny Network Policies<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Kubernetes is open by default. That is a security and stability nightmare. You must implement a &#8220;default-deny&#8221; policy and then explicitly allow the traffic you expect. This is <strong>kubernetes best<\/strong> practice for a reason.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: networking.k8s.io\/v1\nkind: NetworkPolicy\nmetadata:\n  name: default-deny-all\n  namespace: production\nspec:\n  podSelector: {}\n  policyTypes:\n  - Ingress\n  - Egress\n---\napiVersion: networking.k8s.io\/v1\nkind: NetworkPolicy\nmetadata:\n  name: allow-redis-from-app\n  namespace: production\nspec:\n  podSelector:\n    matchLabels:\n      app: redis\n  ingress:\n  - from:\n    - podSelector:\n        matchLabels:\n          app: payment-gateway\n    ports:\n    - protocol: TCP\n      port: 6379\n<\/code><\/pre>\n<p>This policy would have blocked the staging load test from ever touching production. It also prevents a compromised frontend pod from scanning your internal network for other vulnerabilities.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Long_Road_to_Recovery_Etcd_and_the_Aftermath\"><\/span>The Long Road to Recovery: Etcd and the Aftermath<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The final 24 hours were spent in etcd hell. When the cluster collapsed, the etcd database became fragmented. Even after we stopped the bleeding, the API server was slow. <\/p>\n<p>I had to perform an <code>etcdctl defrag<\/code> on each member of the cluster. If you\u2019ve never done this on a live production cluster while your boss is breathing down your neck, consider yourself lucky. <\/p>\n<p>Here is the sequence of commands that saved our skin, run from within the etcd pod:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">export ETCDCTL_API=3\netcdctl --endpoints=https:\/\/127.0.0.1:2379 \\\n  --cacert=\/etc\/kubernetes\/pki\/etcd\/ca.crt \\\n  --cert=\/etc\/kubernetes\/pki\/etcd\/server.crt \\\n  --key=\/etc\/kubernetes\/pki\/etcd\/server.key \\\n  endpoint status --write-out=table\n\n# Defragmenting to reclaim space and improve latency\netcdctl --endpoints=https:\/\/127.0.0.1:2379 \\\n  --cacert=\/etc\/kubernetes\/pki\/etcd\/ca.crt \\\n  --cert=\/etc\/kubernetes\/pki\/etcd\/server.crt \\\n  --key=\/etc\/kubernetes\/pki\/etcd\/server.key \\\n  defrag\n<\/code><\/pre>\n<p>After the defrag, the <code>backend_commit_duration_seconds<\/code> dropped from 150ms to 5ms. The cluster finally felt snappy again. <\/p>\n<p>But we shouldn&#8217;t have been there. We should have had <code>PodDisruptionBudgets<\/code> (PDBs) to ensure that even during a mass failure, a certain percentage of our pods remained available.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The_Fix_PodDisruptionBudget\"><\/span>The Fix: PodDisruptionBudget<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: policy\/v1\nkind: PodDisruptionBudget\nmetadata:\n  name: payment-gateway-pdb\n  namespace: production\nspec:\n  minAvailable: 3\n  selector:\n    matchLabels:\n      app: payment-gateway\n<\/code><\/pre>\n<p>This ensures that the eviction API will refuse to kill pods if it drops the count below 3. It\u2019s a safety net for your availability.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Hard_Truth\"><\/span>The Hard Truth<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We recovered. The cluster is stable. The developer who pushed the change is currently writing a 50-page apology in the form of documentation. <\/p>\n<p>But here\u2019s the reality: Kubernetes is a complex system of distributed state. It is not a &#8220;set it and forget it&#8221; platform. If you aren&#8217;t using tools like <code>kube-score<\/code> or <code>Polaris<\/code> to lint your manifests before they hit the cluster, you are playing Russian Roulette with five chambers loaded.<\/p>\n<p>We\u2019ve now integrated <code>kube-score<\/code> into our CI\/CD pipeline. If a Helm chart doesn&#8217;t have resource limits, proper probes, and a PDB, the build fails. No human intervention required.<\/p>\n<p>The next time someone tells you that Kubernetes is &#8220;easy,&#8221; show them this report. Show them the <code>OOMKilled<\/code> logs. Show them the etcd latency graphs. <\/p>\n<p>I\u2019m going to sleep now. If my pager goes off because you decided to ignore a <code>ResourceQuota<\/code>, don&#8217;t bother calling me. Just start updating your resume. <\/p>\n<p><strong>Final Advice:<\/strong> Your cluster is a reflection of your discipline. If your YAML is a mess, your uptime will be too. Stop looking for &#8220;thought leadership&#8221; and start reading the damn documentation. Kubernetes 1.30 has everything you need to stay stable, but it won&#8217;t save you from your own laziness. <\/p>\n<p>Stay paranoid. Check your limits. And for the love of all that is holy, never deploy on a Friday.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/vim-commands\/\">Vim Commands<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/10-docker-best-practices-to-optimize-your-containers-2\/\">10 Docker Best Practices To Optimize Your Containers 2<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/fixed-nginx-showing-blank-php-pages-with-fastcgi-or-php-fpm\/\">Fixed Nginx Showing Blank Php Pages With Fastcgi Or Php Fpm<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>It is 4:12 AM. My eyes feel like someone rubbed them with industrial-grade sandpaper. The third pot of coffee tastes like battery acid and failed dreams. I\u2019ve been staring at a Grafana dashboard that looks like a heart monitor of a patient in mid-arrest for the last three days. Why? Because a &#8220;senior&#8221; developer decided &#8230; <a title=\"Kubernetes Best Practices: Optimize Your Clusters Today\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\" aria-label=\"Read more  on Kubernetes Best Practices: Optimize Your Clusters Today\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4734","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Kubernetes Best Practices: Optimize Your Clusters Today - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Kubernetes Best Practices: Optimize Your Clusters Today - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"It is 4:12 AM. My eyes feel like someone rubbed them with industrial-grade sandpaper. The third pot of coffee tastes like battery acid and failed dreams. I\u2019ve been staring at a Grafana dashboard that looks like a heart monitor of a patient in mid-arrest for the last three days. Why? Because a &#8220;senior&#8221; developer decided ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-03-14T15:36:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"Kubernetes Best Practices: Optimize Your Clusters Today\",\"datePublished\":\"2026-03-14T15:36:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\"},\"wordCount\":1569,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\",\"name\":\"Kubernetes Best Practices: Optimize Your Clusters Today - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-03-14T15:36:32+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Kubernetes Best Practices: Optimize Your Clusters Today\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Kubernetes Best Practices: Optimize Your Clusters Today - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/","og_locale":"en_US","og_type":"article","og_title":"Kubernetes Best Practices: Optimize Your Clusters Today - ITSupportWale","og_description":"It is 4:12 AM. My eyes feel like someone rubbed them with industrial-grade sandpaper. The third pot of coffee tastes like battery acid and failed dreams. I\u2019ve been staring at a Grafana dashboard that looks like a heart monitor of a patient in mid-arrest for the last three days. Why? Because a &#8220;senior&#8221; developer decided ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-03-14T15:36:32+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"Kubernetes Best Practices: Optimize Your Clusters Today","datePublished":"2026-03-14T15:36:32+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/"},"wordCount":1569,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/","url":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/","name":"Kubernetes Best Practices: Optimize Your Clusters Today - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-03-14T15:36:32+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/kubernetes-best-practices-optimize-your-clusters-today\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Kubernetes Best Practices: Optimize Your Clusters Today"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4734","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4734"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4734\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4734"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4734"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4734"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}