{"id":4743,"date":"2026-03-23T21:35:17","date_gmt":"2026-03-23T16:05:17","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/"},"modified":"2026-03-23T21:35:17","modified_gmt":"2026-03-23T16:05:17","slug":"top-kubernetes-best-practices-for-production-success","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/","title":{"rendered":"Top Kubernetes Best Practices for Production Success"},"content":{"rendered":"<p>It\u2019s 03:14 AM. The pager is screaming, and the cluster is a graveyard of CrashLoopBackOffs. Here is exactly how we got here and why your &#8220;standard&#8221; setup is a liability.<\/p>\n<p>I\u2019m staring at a terminal window that looks like a crime scene. The junior dev\u2014let&#8217;s call him Dave, because it\u2019s always a Dave\u2014pushed a &#8220;minor&#8221; update to the checkout service. He said it was just a dependency bump. Now, the <code>etcd<\/code> cluster is gasping for air, the API server is timing out, and the load balancer is throwing 503s like it\u2019s a sport. <\/p>\n<p>This isn&#8217;t a &#8220;learning opportunity.&#8221; This is a failure of engineering discipline. You followed a Medium tutorial written by someone who has never managed a production cluster under load. You thought Kubernetes would &#8220;self-heal&#8221; your way out of bad architecture. It won&#8217;t. It will just automate the destruction of your infrastructure at a scale you can&#8217;t imagine.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl get pods -n prod-checkout\nNAME                                READY   STATUS             RESTARTS      AGE\ncheckout-api-7f8d9b6c5-4x2z1        0\/1     CrashLoopBackOff   42 (1m ago)   3h\ncheckout-api-7f8d9b6c5-9p0l2        0\/1     OOMKilled          12            3h\ncheckout-worker-5dd5756b68-v9q2w    0\/1     Pending            0             5m\n\n$ kubectl get events --sort-by='.lastTimestamp' -n prod-checkout\n3m12s       Warning   FailedScheduling   pod\/checkout-worker-5dd5756b68-v9q2w   0\/3 nodes are available: 3 Insufficient memory.\n2m45s       Warning   BackOff            pod\/checkout-api-7f8d9b6c5-4x2z1       Back-off restarting failed container\n1m10s       Normal    Killing            pod\/checkout-api-7f8d9b6c5-9p0l2       Stopping container checkout-api\n\n$ kubectl describe node gke-prod-pool-1-3a2b\nEvents:\n  Type     Reason                 Age                From                Message\n  ----     ------                 ----               ----                -------\n  Warning  SystemOOM              5m                 kubelet             System OOM encountered, victim process: checkout-api\n<\/code><\/pre>\n<p>The cluster is dead. The &#8220;kubernetes best&#8221; practices you ignored are now the reason I\u2019m on my fourth cup of lukewarm sludge. Let\u2019s perform the autopsy.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69e3035e68128\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69e3035e68128\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#Your_Resource_Limits_are_a_Joke\" >Your Resource Limits are a Joke<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#The_Namespace_is_Not_a_Security_Boundary\" >The Namespace is Not a Security Boundary<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#Your_Liveness_Probes_are_Killing_Your_Availability\" >Your Liveness Probes are Killing Your Availability<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#etcd_is_Not_a_Magic_Database\" >etcd is Not a Magic Database<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#The_Fallacy_of_the_%E2%80%9CStateless%E2%80%9D_Application\" >The Fallacy of the &#8220;Stateless&#8221; Application<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#SecurityContext_is_Not_Optional\" >SecurityContext is Not Optional<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#The_%E2%80%9CKubernetes_Best%E2%80%9D_Practices_are_Written_in_Blood\" >The &#8220;Kubernetes Best&#8221; Practices are Written in Blood<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#Checklist_for_the_Uninitiated\" >Checklist for the Uninitiated<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Your_Resource_Limits_are_a_Joke\"><\/span>Your Resource Limits are a Joke<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>What the tutorials tell you:<\/strong> &#8220;Just set some limits so your pods don&#8217;t run away with the node. Or don&#8217;t! Kubernetes is smart enough to balance it.&#8221;<\/p>\n<p><strong>The cold, hard reality of production:<\/strong> If you don&#8217;t define <code>requests<\/code> and <code>limits<\/code> with surgical precision, the <code>kube-scheduler<\/code> is flying blind. Dave\u2019s &#8220;minor change&#8221; included a new library that pre-allocates 2GB of heap on startup. Because he didn&#8217;t update the manifest, the pod started with a <code>request<\/code> of 256MB. The scheduler saw a node with 512MB free and said, &#8220;Yeah, that fits.&#8221; <\/p>\n<p>Then the pod actually tried to start. It hit the <code>cgroup<\/code> limit, the kernel OOMKiller woke up, and it executed the process. But it didn&#8217;t just kill the pod; because of the way <code>cgroups v2<\/code> handles memory pressure in Kubernetes v1.29 and v1.30, the entire node started thrashing. We ended up with a &#8220;noisy neighbor&#8221; situation where the checkout service strangled the <code>kubelet<\/code> itself.<\/p>\n<p>In a real environment, you use the <code>Guaranteed<\/code> Quality of Service (QoS) class for critical workloads. That means <code>requests<\/code> == <code>limits<\/code>. No overcommitting. No &#8220;burstable&#8221; nonsense for the core API.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># This is not a suggestion. This is a survival requirement.\napiVersion: v1\nkind: ResourceQuota\nmetadata:\n  name: compute-resources\n  namespace: prod-checkout\nspec:\n  hard:\n    requests.cpu: &quot;20&quot;\n    requests.memory: 40Gi\n    limits.cpu: &quot;40&quot;\n    limits.memory: 80Gi\n---\napiVersion: v1\nkind: Pod\nmetadata:\n  name: checkout-api\nspec:\n  containers:\n  - name: checkout-api\n    image: company\/checkout:v1.2.3\n    resources:\n      requests:\n        memory: &quot;2Gi&quot;\n        cpu: &quot;1000m&quot;\n      limits:\n        memory: &quot;2Gi&quot;\n        cpu: &quot;1000m&quot; # Guaranteed QoS class\n<\/code><\/pre>\n<p>If you don&#8217;t enforce <code>ResourceQuotas<\/code> at the namespace level, you are one <code>git push<\/code> away from a total cluster blackout. Dave didn&#8217;t know the node capacity. The <code>ResourceQuota<\/code> would have rejected his deployment at the API level before it ever hit the scheduler. Instead, we got a cascading failure.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Namespace_is_Not_a_Security_Boundary\"><\/span>The Namespace is Not a Security Boundary<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>What the tutorials tell you:<\/strong> &#8220;Use namespaces to organize your apps. It keeps things clean and secure.&#8221;<\/p>\n<p><strong>The cold, hard reality of production:<\/strong> A namespace is just a prefix in the API server. By default, every pod in your cluster can talk to every other pod. Dave\u2019s &#8220;minor change&#8221; included a debugging tool that accidentally exposed a metrics endpoint on all interfaces. Because we didn&#8217;t have <code>NetworkPolicies<\/code> enforced, a compromised pod in the <code>dev-sandbox<\/code> namespace (which some genius linked to the same cluster) could have curled the internal checkout DB.<\/p>\n<p>In Kubernetes v1.30, we have better support for <code>AdminNetworkPolicy<\/code>, but most of you are still running wide-open flat networks. You\u2019re essentially running a 1990s-style LAN and wondering why you\u2019re getting lateral movement during a breach.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># Default Deny All - Start from zero, you cowards.\napiVersion: networking.k8s.io\/v1\nkind: NetworkPolicy\nmetadata:\n  name: default-deny-all\n  namespace: prod-checkout\nspec:\n  podSelector: {}\n  policyTypes:\n  - Ingress\n  - Egress\n---\n# Only allow the API to talk to the DB\napiVersion: networking.k8s.io\/v1\nkind: NetworkPolicy\nmetadata:\n  name: allow-db-access\n  namespace: prod-checkout\nspec:\n  podSelector:\n    matchLabels:\n      app: checkout-api\n  egress:\n  - to:\n    - podSelector:\n        matchLabels:\n          app: checkout-db\n    ports:\n    - protocol: TCP\n      port: 5432\n<\/code><\/pre>\n<p>Without this, your &#8220;blast radius&#8221; is the entire cluster. You\u2019re one RCE away from losing the keys to the kingdom. We spent two hours of this outage just verifying that the <code>CrashLoopBackOff<\/code> wasn&#8217;t an active exfiltration attempt because our observability into internal traffic is garbage.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Your_Liveness_Probes_are_Killing_Your_Availability\"><\/span>Your Liveness Probes are Killing Your Availability<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>What the tutorials tell you:<\/strong> &#8220;Use liveness probes to restart your app if it hangs. It\u2019s the magic &#8216;fix-it&#8217; button.&#8221;<\/p>\n<p><strong>The cold, hard reality of production:<\/strong> Liveness probes are a loaded gun pointed at your foot. Dave set a liveness probe to check the database connection. When the database got slow due to the increased load, the liveness probe failed. Kubernetes, being the obedient soldier it is, killed the pod. <\/p>\n<p>This created a &#8220;death spiral.&#8221; The remaining pods took on the extra traffic, got even slower, failed <em>their<\/em> liveness probes, and were killed. Within 90 seconds, the entire deployment was down. <\/p>\n<p>You use <code>readinessProbes<\/code> to control traffic flow. You use <code>livenessProbes<\/code> <em>only<\/em> to catch hard deadlocks that the application cannot recover from. And for the love of all that is holy, use <code>startupProbes<\/code> for apps that take a long time to initialize so you don&#8217;t kill them while they&#8217;re still loading their 500MB of Java classes.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">readinessProbe:\n  httpGet:\n    path: \/healthz\/ready\n    port: 8080\n  initialDelaySeconds: 5\n  periodSeconds: 10\n  failureThreshold: 3\nlivenessProbe:\n  httpGet:\n    path: \/healthz\/live # This should NOT check the DB. Only the process state.\n    port: 8080\n  initialDelaySeconds: 15\n  periodSeconds: 20\n<\/code><\/pre>\n<p>If your liveness probe depends on an external dependency, you have built a distributed suicide machine. <\/p>\n<h2><span class=\"ez-toc-section\" id=\"etcd_is_Not_a_Magic_Database\"><\/span>etcd is Not a Magic Database<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>What the tutorials tell you:<\/strong> &#8220;Kubernetes stores its state in etcd. It\u2019s highly available and consistent.&#8221;<\/p>\n<p><strong>The cold, hard reality of production:<\/strong> <code>etcd<\/code> is a finicky beast that demands high-performance IOPS and low latency. Dave\u2019s &#8220;minor change&#8221; caused a massive spike in &#8220;drift&#8221; because he was using a controller that was aggressively updating labels on thousands of pods every second. This flooded the <code>kube-apiserver<\/code> with write requests.<\/p>\n<p>The <code>etcd<\/code> write-ahead log (WAL) couldn&#8217;t keep up because we were running on standard persistent disks instead of local SSDs. Disk latency spiked to 50ms. Raft consensus started failing. The nodes started flapping between <code>Ready<\/code> and <code>NotReady<\/code> because the <code>kubelet<\/code> couldn&#8217;t update its heartbeat in time.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># How I knew we were screwed\n$ kubectl get componentstatuses\nNAME                 STATUS      MESSAGE                                                                                                   ERROR\netcd-0               Unhealthy   {&quot;health&quot;:&quot;false&quot;,&quot;reason&quot;:&quot;remote error: tls: internal error&quot;}\n<\/code><\/pre>\n<p>When <code>etcd<\/code> suffers, the whole world burns. You need to monitor <code>etcd_disk_wal_fync_duration_seconds_bucket<\/code>. If that p99 goes over 10ms, you are in the danger zone. We were at 150ms. The &#8220;kubernetes best&#8221; way to handle this is to isolate <code>etcd<\/code> on its own dedicated nodes with NVMe drives, but no, you wanted to save money by running it on the same general-purpose instances as your Jenkins runners.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Fallacy_of_the_%E2%80%9CStateless%E2%80%9D_Application\"><\/span>The Fallacy of the &#8220;Stateless&#8221; Application<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>What the tutorials tell you:<\/strong> &#8220;Kubernetes is for stateless apps! Just scale them up and down!&#8221;<\/p>\n<p><strong>The cold, hard reality of production:<\/strong> Nothing is stateless. Everything has state\u2014it\u2019s just a question of where you\u2019re hiding it. In our case, Dave\u2019s app was &#8220;stateless&#8221; but relied on a <code>PersistentVolume<\/code> (PV) for a legacy file-processing module. <\/p>\n<p>When the pods started crashing, the <code>ReadWriteOnce<\/code> (RWO) volume was stuck in a &#8220;Multi-Attach Error.&#8221; The old pod was dead but the cloud provider hadn&#8217;t released the volume attachment yet. The new pod couldn&#8217;t start because it couldn&#8217;t mount the disk. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl describe pod checkout-api-7f8d9b6c5-4x2z1\nEvents:\n  Warning  FailedMount  3m  kubelet  Unable to attach or mount volumes: timed out waiting for the condition\n<\/code><\/pre>\n<p>We spent 45 minutes manually detaching AWS EBS volumes in the console because the <code>volume-attachment-manager<\/code> in the controller-manager was backed up due to the <code>etcd<\/code> latency issues. If you\u2019re going to use volumes, you need to understand the limitations of your CSI (Container Storage Interface) driver. You can&#8217;t just wish away the laws of physics and distributed systems.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"SecurityContext_is_Not_Optional\"><\/span>SecurityContext is Not Optional<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>What the tutorials tell you:<\/strong> &#8220;Just run the container. It works on my machine!&#8221;<\/p>\n<p><strong>The cold, hard reality of production:<\/strong> Running as root is a death sentence. Dave\u2019s &#8220;minor change&#8221; used a base image that defaulted to the root user. When the pod was compromised\u2014or even just when it crashed\u2014it had permissions it didn&#8217;t need. <\/p>\n<p>In v1.29+, Pod Security Admissions are the standard. If you aren&#8217;t enforcing the <code>restricted<\/code> profile, you\u2019re failing. I found three pods running with <code>privileged: true<\/code> because someone wanted to &#8220;debug a network issue&#8221; six months ago and never changed it back. That\u2019s how you get container escapes. That\u2019s how a &#8220;minor change&#8221; becomes a headline in the Wall Street Journal.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">spec:\n  securityContext:\n    runAsNonRoot: true\n    runAsUser: 1000\n    fsGroup: 2000\n  containers:\n  - name: checkout-api\n    securityContext:\n      allowPrivilegeEscalation: false\n      capabilities:\n        drop:\n        - ALL\n      privileged: false\n      readOnlyRootFilesystem: true\n<\/code><\/pre>\n<p>If your containers don&#8217;t have a <code>readOnlyRootFilesystem<\/code>, you\u2019re inviting attackers to install their own toolsets. Dave\u2019s app tried to write a temp file to <code>\/app\/config<\/code> and failed because I finally locked it down. That\u2019s why it was crashing. He should have been using an <code>emptyDir<\/code> for temporary storage, but he was lazy.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_%E2%80%9CKubernetes_Best%E2%80%9D_Practices_are_Written_in_Blood\"><\/span>The &#8220;Kubernetes Best&#8221; Practices are Written in Blood<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>You don&#8217;t implement these things because they&#8217;re &#8220;best.&#8221; You implement them because the alternative is what I\u2019m doing right now: sitting in a cold server room, smelling like old coffee and regret, manually deleting <code>finalizers<\/code> from stuck <code>ConfigMaps<\/code>.<\/p>\n<p>The &#8220;kubernetes best&#8221; approach isn&#8217;t about using the newest features in v1.30; it\u2019s about respecting the complexity of the system. It\u2019s about realizing that Kubernetes is a platform for building platforms, not a place to dump your unoptimized Docker images and hope for the best.<\/p>\n<p>We\u2019ve spent the last three days dealing with &#8220;toil&#8221;\u2014manual, repetitive tasks that could have been avoided with proper automation and policy enforcement. We had &#8220;drift&#8221; between our staging and production environments because someone manually edited a deployment using <code>kubectl edit<\/code> instead of updating the Helm chart. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Finding the drift that killed us\n$ kubectl get deployment checkout-api -o yaml &gt; live.yaml\n$ helm get manifest checkout-release &gt; helm.yaml\n$ diff live.yaml helm.yaml\n&lt;   image: company\/checkout:v1.2.3-debug-DONT-PUSH\n---\n&gt;   image: company\/checkout:v1.2.2\n<\/code><\/pre>\n<p>There it is. Dave pushed a debug image directly to the cluster. No CI\/CD pipeline check. No admission controller to stop images with &#8220;DONT-PUSH&#8221; in the tag. Just raw, unadulterated incompetence.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Checklist_for_the_Uninitiated\"><\/span>Checklist for the Uninitiated<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If you want to avoid being the reason I\u2019m awake at 3 AM, you will follow this checklist. This isn&#8217;t a suggestion. It\u2019s a mandate.<\/p>\n<ol>\n<li><strong>QoS or GTFO:<\/strong> Every production pod must have <code>requests<\/code> and <code>limits<\/code> defined. Critical services must use the <code>Guaranteed<\/code> class (requests == limits).<\/li>\n<li><strong>Default Deny:<\/strong> Implement a <code>NetworkPolicy<\/code> that denies all traffic by default. Explicitly whitelist every single connection. If you don&#8217;t know what your app talks to, you don&#8217;t know your app.<\/li>\n<li><strong>Probes are for Health, Not Dependencies:<\/strong> Liveness probes check the process. Readiness probes check the ability to serve traffic. Never, ever point a liveness probe at a database.<\/li>\n<li><strong>No Root, Ever:<\/strong> Use <code>PodSecurityContext<\/code> to run as a non-root user. Set <code>allowPrivilegeEscalation: false<\/code>. If your app &#8220;needs&#8221; root, your app is broken.<\/li>\n<li><strong>Monitor etcd Like Your Life Depends On It:<\/strong> Because it does. Watch your <code>fsync<\/code> latency. Use dedicated, fast storage.<\/li>\n<li><strong>Immutable Infrastructure:<\/strong> If you use <code>kubectl edit<\/code> on a production resource, I will find you. Everything goes through Git. Use a tool like ArgoCD or Flux to detect and remediate drift automatically.<\/li>\n<li><strong>Pod Disruption Budgets (PDBs):<\/strong> If you\u2019re running more than one replica (and you should be), you need a PDB. This prevents the cluster autoscaler or a node upgrade from taking down all your pods at once.<\/li>\n<li><strong>TerminationGracePeriodSeconds:<\/strong> Give your app enough time to shut down gracefully. If your app takes 45 seconds to drain connections, don&#8217;t leave the default at 30.<\/li>\n<li><strong>Use Admission Controllers:<\/strong> Implement <code>ValidatingAdmissionWebhooks<\/code> to reject any manifest that doesn&#8217;t meet these standards. Don&#8217;t trust humans. Humans are the problem.<\/li>\n<li><strong>Version Pinning:<\/strong> Pin your images to a digest (SHA), not a tag. <code>v1.2.3<\/code> can be overwritten. A SHA is forever.<\/li>\n<\/ol>\n<p>The sun is coming up. The cluster is stable, mostly because I\u2019ve manually scaled the checkout service to zero to let <code>etcd<\/code> recover. Dave is going to have a very long meeting with me in two hours. <\/p>\n<p>Kubernetes is a powerful tool, but in the hands of the &#8220;standard&#8221; user, it\u2019s just a very expensive way to fail. Go back to basics. Fix your manifests. Stop bikeshedding about which service mesh to use and start worrying about your <code>cgroup<\/code> limits. <\/p>\n<p>Now, if you\u2019ll excuse me, I need to find a place to sleep that doesn&#8217;t vibrate at the frequency of a server rack.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/fixed-nginx-showing-blank-php-pages-with-fastcgi-or-php-fpm\/\">Fixed Nginx Showing Blank Php Pages With Fastcgi Or Php Fpm<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/install-laravel-on-ubuntu-20-04-lts-with-apache2-and-php-7-4\/\">Install Laravel On Ubuntu 20 04 Lts With Apache2 And Php 7 4<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/install-php-in-ubuntu-18-04\/\">Install Php In Ubuntu 18 04<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>It\u2019s 03:14 AM. The pager is screaming, and the cluster is a graveyard of CrashLoopBackOffs. Here is exactly how we got here and why your &#8220;standard&#8221; setup is a liability. I\u2019m staring at a terminal window that looks like a crime scene. The junior dev\u2014let&#8217;s call him Dave, because it\u2019s always a Dave\u2014pushed a &#8220;minor&#8221; &#8230; <a title=\"Top Kubernetes Best Practices for Production Success\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\" aria-label=\"Read more  on Top Kubernetes Best Practices for Production Success\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4743","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Top Kubernetes Best Practices for Production Success - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Top Kubernetes Best Practices for Production Success - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"It\u2019s 03:14 AM. The pager is screaming, and the cluster is a graveyard of CrashLoopBackOffs. Here is exactly how we got here and why your &#8220;standard&#8221; setup is a liability. I\u2019m staring at a terminal window that looks like a crime scene. The junior dev\u2014let&#8217;s call him Dave, because it\u2019s always a Dave\u2014pushed a &#8220;minor&#8221; ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-03-23T16:05:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"Top Kubernetes Best Practices for Production Success\",\"datePublished\":\"2026-03-23T16:05:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\"},\"wordCount\":1888,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\",\"name\":\"Top Kubernetes Best Practices for Production Success - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-03-23T16:05:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Top Kubernetes Best Practices for Production Success\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Top Kubernetes Best Practices for Production Success - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/","og_locale":"en_US","og_type":"article","og_title":"Top Kubernetes Best Practices for Production Success - ITSupportWale","og_description":"It\u2019s 03:14 AM. The pager is screaming, and the cluster is a graveyard of CrashLoopBackOffs. Here is exactly how we got here and why your &#8220;standard&#8221; setup is a liability. I\u2019m staring at a terminal window that looks like a crime scene. The junior dev\u2014let&#8217;s call him Dave, because it\u2019s always a Dave\u2014pushed a &#8220;minor&#8221; ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-03-23T16:05:17+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"Top Kubernetes Best Practices for Production Success","datePublished":"2026-03-23T16:05:17+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/"},"wordCount":1888,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/","url":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/","name":"Top Kubernetes Best Practices for Production Success - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-03-23T16:05:17+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/top-kubernetes-best-practices-for-production-success\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Top Kubernetes Best Practices for Production Success"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4743","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4743"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4743\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4743"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4743"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4743"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}