{"id":4775,"date":"2026-04-30T21:56:46","date_gmt":"2026-04-30T16:26:46","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/"},"modified":"2026-04-30T21:56:46","modified_gmt":"2026-04-30T16:26:46","slug":"10-kubernetes-best-practices-for-production-success-2","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/","title":{"rendered":"10 Kubernetes Best Practices for Production Success"},"content":{"rendered":"<p><code>LAST SEEN   TYPE      REASON             OBJECT                                  MESSAGE<\/code><br \/>\n<code>12s         Warning   BackOff            pod\/api-gateway-7f8d9b-x2k              Back-off restarting failed container<\/code><br \/>\n<code>4s          Warning   Unhealthy          pod\/auth-svc-66c4d-99z                  Liveness probe failed: HTTP probe failed with statuscode: 503<\/code><br \/>\n<code>1s          Normal    Killing            pod\/payment-worker-88v                  Stopping container payment-worker<\/code><br \/>\n<code>0s          Warning   EvictionThreshold  node\/ip-10-0-42-101.ec2.internal        The node was low on resource: memory.<\/code><br \/>\n<code>[124892.12] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=\/,mems_allowed=0,oom_memcg=\/kubepods.slice\/kubepods-burstable.slice\/kubepods-burstable-pod-a1...<\/code><br \/>\n<code>[124892.15] Memory cgroup out of memory: Killed process 12409 (java) total-vm:12.4GB, anon-rss:2.1GB, file-rss:0B, shmem-rss:0B<\/code><br \/>\n<code>[124892.20] etcdserver: failed to send out heartbeat on time (deadline exceeded for 142.12ms)<\/code><\/p>\n<p>I haven&#8217;t slept in 72 hours. My eyes feel like they\u2019ve been scrubbed with industrial-grade sandpaper. The &#8220;DevOps Lead&#8221;\u2014a man who thinks YAML is a programming language and Docker is a &#8220;cloud OS&#8221;\u2014just asked if we can &#8220;reboot the internet.&#8221; I didn&#8217;t answer. I just stared at the flickering cursor on my terminal.<\/p>\n<p>We migrated to Kubernetes v1.29.2 last week. The &#8220;YAML engineers&#8221; promised it would be &#8220;efficient.&#8221; They lied. They treated the kernel like a black box and the control plane like a magic wand. They ignored every warning. Now, the cluster is a graveyard of crashed pods and timed-out heartbeats. This is how it happened. This is how you avoid it.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a0e39119da24\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a0e39119da24\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#The_Resource_Limit_Lie\" >The Resource Limit Lie<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#Why_Your_Liveness_Probes_are_Killing_the_Database\" >Why Your Liveness Probes are Killing the Database<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#Pod_Disruption_Budgets_The_Self-Inflicted_Denial_of_Service\" >Pod Disruption Budgets: The Self-Inflicted Denial of Service<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#The_%E2%80%98Latest_Tag_Suicide_Pact\" >The &#8216;Latest&#8217; Tag Suicide Pact<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#Etcd_and_the_Disk_Latency_Guillotine\" >Etcd and the Disk Latency Guillotine<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#Admission_Webhooks_The_Silent_Killer\" >Admission Webhooks: The Silent Killer<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#The_Kubelet_Eviction_Death_Spiral\" >The Kubelet Eviction Death Spiral<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#Networking_Cilium_and_the_BPF_Map_Overflow\" >Networking: Cilium and the BPF Map Overflow<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#The_Aftermath\" >The Aftermath<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"The_Resource_Limit_Lie\"><\/span>The Resource Limit Lie<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The first domino fell because of a misunderstanding of how the Linux kernel handles memory. Our &#8220;architects&#8221; decided that setting <code>limits<\/code> without <code>requests<\/code> was a great way to &#8220;save money.&#8221; <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># THE CRIME\napiVersion: v1\nkind: Pod\nmetadata:\n  name: memory-hog\nspec:\n  containers:\n  - name: app\n    image: our-shitty-app:latest\n    resources:\n      limits:\n        memory: &quot;2Gi&quot;\n        cpu: &quot;1&quot;\n      # No requests. The scheduler is flying blind.\n<\/code><\/pre>\n<p>When you omit <code>requests<\/code>, Kubernetes defaults them to match your <code>limits<\/code>. This sounds fine until you realize the scheduler now thinks every node has infinite capacity because it isn&#8217;t accounting for the actual baseline usage. We hit v1.29.2, which utilizes Cgroup v2 more aggressively. The Kubelet\u2019s <code>NodeAllocatable<\/code> was ignored. The scheduler packed 40 of these pods onto a node that only had 64GB of RAM. <\/p>\n<p>The math doesn&#8217;t work. The kernel OOM killer doesn&#8217;t care about your &#8220;cloud-native&#8221; dreams. When the node hit 95% utilization, the Kubelet tried to evict pods. But since these were all &#8220;Burstable&#8221; QoS class pods (because requests matched limits, but the actual usage was spiking), the kernel started reaping processes at random. <\/p>\n<p>The <strong>kubernetes best<\/strong> practice is simple: Always set <code>requests<\/code> equal to <code>limits<\/code> for production workloads. This puts the pod in the &#8220;Guaranteed&#8221; QoS class. The <code>oom_score_adj<\/code> for a Guaranteed pod is -997. It is the last thing the kernel kills. If you don&#8217;t do this, you are telling the kernel that your application is disposable. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># THE FIX\n    resources:\n      requests:\n        memory: &quot;2Gi&quot;\n        cpu: &quot;1&quot;\n      limits:\n        memory: &quot;2Gi&quot;\n        cpu: &quot;1&quot;\n<\/code><\/pre>\n<p>The Kubelet in v1.29.2 is more sensitive to memory pressure. If you don&#8217;t define your requests, the <code>topologyManager<\/code> can&#8217;t align your CPU and memory resources correctly. You get context switching. You get latency. You get fired.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Why_Your_Liveness_Probes_are_Killing_the_Database\"><\/span>Why Your Liveness Probes are Killing the Database<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>While the nodes were screaming, the &#8220;YAML engineers&#8221; made it worse. They configured liveness probes that queried the database. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># THE SUICIDE NOTE\nlivenessProbe:\n  httpGet:\n    path: \/health-check-that-queries-db\n    port: 8080\n  initialDelaySeconds: 5\n  periodSeconds: 5\n<\/code><\/pre>\n<p>The database was struggling because of the connection churn from the OOM-looping pods. It slowed down. The liveness probe took 2 seconds instead of 200ms. The Kubelet, being a dutiful soldier, decided the pod was dead. It killed the pod. It started a new one. The new pod immediately tried to connect to the database to run its &#8220;startup logic.&#8221; <\/p>\n<p>Multiply this by 500 pods. You just launched a self-inflicted DDoS attack against your own RDS instance. The database CPU hit 100%. Every single pod in the cluster failed its liveness probe simultaneously. <\/p>\n<p>The <strong>kubernetes best<\/strong> approach is to decouple your probes. A liveness probe should only check if the process is alive. It should check a local, in-memory flag. It should never, under any circumstances, cross a network boundary. Use <code>startupProbes<\/code> for long-running migrations and <code>readinessProbes<\/code> to gate traffic. If the DB is down, the pod shouldn&#8217;t be killed; it should just stop receiving traffic.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># THE STABLE CONFIG\nstartupProbe:\n  httpGet:\n    path: \/ready\n    port: 8080\n  failureThreshold: 30\n  periodSeconds: 10\nlivenessProbe:\n  httpGet:\n    path: \/live # Just returns 200 OK from memory\n    port: 8080\n  periodSeconds: 20\n<\/code><\/pre>\n<p>In v1.29, the Kubelet has improved internal locking for probes. But if your probe logic is garbage, the Kubelet just executes garbage faster.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Pod_Disruption_Budgets_The_Self-Inflicted_Denial_of_Service\"><\/span>Pod Disruption Budgets: The Self-Inflicted Denial of Service<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>During the height of the outage, I tried to drain a failing node. <\/p>\n<p><code>kubectl drain ip-10-0-42-101.ec2.internal --ignore-daemonsets --delete-emptydir-data<\/code><\/p>\n<p>The command hung. <\/p>\n<p><code>evicting pod \"auth-svc-66c4d-99z\"<\/code><br \/>\n<code>error when evicting pods\/\"auth-svc-66c4d-99z\" (assigned to node \"ip-10-0-42-101.ec2.internal\"): Cannot evict pod as it would violate the pod's disruption budget.<\/code><\/p>\n<p>Some genius had set a <code>PodDisruptionBudget<\/code> with <code>minAvailable: 100%<\/code>. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: policy\/v1\nkind: PodDisruptionBudget\nmetadata:\n  name: auth-pdb\nspec:\n  minAvailable: 100% # This is a hostage situation\n  selector:\n    matchLabels:\n      app: auth-svc\n<\/code><\/pre>\n<p>The app had 3 replicas. One was already down due to the OOM issues. The PDB prevented the eviction of the remaining two. I couldn&#8217;t drain the node. I couldn&#8217;t patch the underlying kernel issue. I was locked out of my own infrastructure by a YAML file.<\/p>\n<p>A <strong>kubernetes best<\/strong> practice is to use <code>maxUnavailable: 1<\/code>. This ensures that at least one pod can always be evicted, allowing the cluster to self-heal and nodes to be rotated. If your app can&#8217;t handle one instance being down, your app isn&#8217;t &#8220;cloud-native&#8221;; it&#8217;s a legacy monolith in a trench coat.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_%E2%80%98Latest_Tag_Suicide_Pact\"><\/span>The &#8216;Latest&#8217; Tag Suicide Pact<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>As I fought to stabilize the API server, a new set of errors appeared in the logs. <\/p>\n<p><code>ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image \"our-repo\/payment-worker:latest\": failed to resolve reference \"our-repo\/payment-worker:latest\": pull access denied<\/code><\/p>\n<p>The CI\/CD pipeline had just pushed a broken image to the <code>latest<\/code> tag. Because the pods were crashing and restarting, they pulled the &#8220;latest&#8221; version. Half the cluster was running v1.0.4, and the other half was trying to run a broken v1.1.0 that someone had pushed five minutes ago.<\/p>\n<p>Never use <code>latest<\/code>. It is non-deterministic. It is the antithesis of immutable infrastructure. In v1.29.2, the container runtime (containerd) is more efficient at caching, but that doesn&#8217;t help you when the tag itself is a moving target. <\/p>\n<p>Use SHA-256 digests or at least specific semantic versions. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># THE ONLY WAY\nimage: our-repo\/payment-worker:v1.0.4@sha256:7f8d9b...\n<\/code><\/pre>\n<p>If we had used immutable tags, the crashing pods would have at least restarted with the previously working code. Instead, we were debugging a production outage and a broken deployment simultaneously.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Etcd_and_the_Disk_Latency_Guillotine\"><\/span>Etcd and the Disk Latency Guillotine<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The control plane finally died at 3:00 AM. <code>kubectl<\/code> commands returned <code>Error from server (Timeout)<\/code>. <\/p>\n<p>I checked the etcd logs. <code>wal: sync duration of 500ms, expected less than 100ms<\/code>. <\/p>\n<p>The &#8220;YAML engineers&#8221; had provisioned the control plane nodes with standard GP3 EBS volumes but didn&#8217;t bother to check the IOPS. They also decided to run a logging agent on the same nodes that was writing 50GB of JSON logs to the same disk as the etcd WAL (Write Ahead Log).<\/p>\n<p>Etcd is the brain of Kubernetes. It is extremely sensitive to disk latency. When the WAL write takes too long, etcd misses its heartbeat. The cluster loses its leader. It triggers an election. During the election, the API server is read-only or completely unresponsive. <\/p>\n<p>In v1.29, the API server&#8217;s <code>Priority-and-Fairness<\/code> (APF) settings are more robust, but they can&#8217;t save you if the underlying data store is stuck in an I\/O wait state. <\/p>\n<p>I had to manually SSH into the master nodes, kill the logging agent, and move the etcd data directory to a dedicated NVMe drive. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Emergency surgery\nsystemctl stop etcd\nmount \/dev\/nvme1n1 \/var\/lib\/etcd\nrsync -av \/var\/lib\/etcd_old\/ \/var\/lib\/etcd\/\nsystemctl start etcd\n<\/code><\/pre>\n<p>If you are running your own control plane, etcd must have its own dedicated, low-latency disk. This isn&#8217;t a suggestion. It is a requirement for survival.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Admission_Webhooks_The_Silent_Killer\"><\/span>Admission Webhooks: The Silent Killer<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The final blow came from a &#8220;Security Admission Controller&#8221; that a third-party vendor had installed. It was a mutating webhook. Every time a pod was created, the API server sent a JSON payload to this webhook to &#8220;validate&#8221; it.<\/p>\n<p>The webhook was running <em>inside<\/em> the cluster. <\/p>\n<p>The cluster was failing. The webhook pods were OOMKilled. The API server was configured with <code>failurePolicy: Fail<\/code>. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># THE TRAP\napiVersion: admissionregistration.k8s.io\/v1\nkind: ValidatingWebhookConfiguration\nwebhooks:\n  - name: security-check.example.com\n    failurePolicy: Fail # The cluster will now die if this pod dies\n    rules:\n      - operations: [&quot;CREATE&quot;]\n        apiGroups: [&quot;&quot;]\n        apiVersions: [&quot;v1&quot;]\n        resources: [&quot;pods&quot;]\n<\/code><\/pre>\n<p>Since the webhook was down, the API server refused to create <em>any<\/em> new pods. I couldn&#8217;t even deploy a fix because the API server would try to validate the fix against the dead webhook and fail. <\/p>\n<p>I had to manually edit the <code>ValidatingWebhookConfiguration<\/code> to set the <code>failurePolicy<\/code> to <code>Ignore<\/code> just to get the cluster to breathe again. <\/p>\n<p>Admission webhooks must have a timeout (v1.29 defaults to 10 seconds, which is still too long) and a sensible failure policy. If the webhook isn&#8217;t critical for life-safety, set it to <code>Ignore<\/code>. If it is critical, run it outside the cluster or in a highly available, dedicated pool that doesn&#8217;t share resources with the apps it&#8217;s validating.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Kubelet_Eviction_Death_Spiral\"><\/span>The Kubelet Eviction Death Spiral<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>By hour 60, I was seeing nodes marked as <code>NotReady<\/code>. <\/p>\n<p><code>kubectl describe node<\/code> showed <code>MemoryPressure<\/code> was true. <\/p>\n<p>The Kubelet has a set of hard eviction thresholds. By default, it&#8217;s often <code>memory.available &lt; 100Mi<\/code>. In v1.29.2, the Kubelet&#8217;s interaction with <code>systemd-oomd<\/code> can be complex. If you haven&#8217;t configured your <code>kubelet<\/code> flags correctly, it will start killing critical system processes before it kills the rogue Java app that&#8217;s actually eating the memory.<\/p>\n<p>We hadn&#8217;t set <code>--eviction-hard<\/code>. We hadn&#8217;t set <code>--system-reserved<\/code> or <code>--kube-reserved<\/code>. <\/p>\n<p>The Kubelet was fighting the OS for the last 500MB of RAM. The OS won. The Kubelet was killed. The node went <code>NotReady<\/code>. The scheduler saw the node was gone and tried to move all its pods to the <em>other<\/em> nodes, which were already at 90% capacity. <\/p>\n<p>This is the &#8220;Thundering Herd&#8221; of Kubernetes. One node dies, its load kills the next node, and so on, until you have a data center full of expensive heaters that don&#8217;t do any work.<\/p>\n<p>You must reserve resources for the system.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># kubelet-config.yaml\napiVersion: kubelet.config.k8s.io\/v1beta1\nkind: KubeletConfiguration\nsystemReserved:\n  cpu: &quot;500m&quot;\n  memory: &quot;1Gi&quot;\nkubeReserved:\n  cpu: &quot;500m&quot;\n  memory: &quot;1Gi&quot;\nevictionHard:\n  memory.available: &quot;500Mi&quot;\n  nodefs.available: &quot;10%&quot;\n<\/code><\/pre>\n<p>This ensures the Kubelet and the underlying Linux kernel have enough breathing room to actually perform the evictions and housekeeping necessary to keep the node alive.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Networking_Cilium_and_the_BPF_Map_Overflow\"><\/span>Networking: Cilium and the BPF Map Overflow<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We use Cilium for CNI. It&#8217;s powerful. It&#8217;s also complex. The &#8220;YAML engineers&#8221; had created a <code>NetworkPolicy<\/code> for every single microservice, which is good. But they used fine-grained L7 rules (HTTP path filtering) for everything.<\/p>\n<p>In a cluster with 10,000+ pods, this exploded the BPF maps. <\/p>\n<p><code>level=warning msg=\"BPF map is full\" subsystem=bpf-map-manager<\/code><\/p>\n<p>The networking stack started dropping packets. Not all packets\u2014just some. The most frustrating kind of failure. A 5% packet loss that looks like application latency. <\/p>\n<p>The &#8220;YAML engineers&#8221; spent 12 hours debugging their code, blaming &#8220;slow Python libraries.&#8221; It wasn&#8217;t the code. It was the fact that the kernel&#8217;s BPF maps were overflowing because we had too many complex policies for the allocated map sizes.<\/p>\n<p>We had to tune the Cilium configuration to increase the <code>bpf-map-dynamic-size-ratio<\/code>. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># cilium-config\nbpf-map-dynamic-size-ratio: 0.0055\n<\/code><\/pre>\n<p>And we had to tell the engineers that they don&#8217;t need L7 filtering for a service that only talks to one other service on a single port. Use L4 policies where possible. It&#8217;s faster, it&#8217;s simpler, and it doesn&#8217;t break the kernel.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Aftermath\"><\/span>The Aftermath<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The cluster is back. I&#8217;ve deleted the <code>latest<\/code> tags. I&#8217;ve set resource requests and limits. I&#8217;ve fixed the PDBs. I&#8217;ve moved etcd to its own disks. <\/p>\n<p>The &#8220;YAML engineers&#8221; are complaining that I&#8217;ve &#8220;restricted their creativity&#8221; by implementing these &#8220;arbitrary rules.&#8221; <\/p>\n<p>They don&#8217;t understand that the kernel isn&#8217;t arbitrary. The kernel is a cold, hard reality. Kubernetes is just a way to organize that reality, but if you ignore the fundamentals\u2014memory, CPU, I\/O, and networking\u2014it will fail you.<\/p>\n<p>This post-mortem isn&#8217;t just a record of a failure. It&#8217;s a warning. v1.29.2 is a powerful tool, but it requires respect. If you treat it like a toy, it will burn your house down. <\/p>\n<p>I&#8217;m going to sleep now. If the pager goes off because someone changed a <code>failurePolicy<\/code> back to <code>Fail<\/code>, I&#8217;m not answering. I&#8217;m deleting their namespace. That&#8217;s my new &#8220;kubernetes best&#8221; practice. <\/p>\n<p>Total uptime: 0.01% this week.<br \/>\nTotal coffee consumed: 42 liters.<br \/>\nTotal respect for &#8220;YAML engineers&#8221;: 0.<\/p>\n<p>Go fix your manifests before I do it for you. With <code>kubectl delete<\/code>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/centos-8-installation-with-screenshots\/\">Centos 8 Installation With Screenshots<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/kali-linux-2020-1-released-new-features-and-download\/\">Kali Linux 2020 1 Released New Features And Download<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/how-to-install-asterisk-16-on-ubuntu-18-04-lts\/\">How To Install Asterisk 16 On Ubuntu 18 04 Lts<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>LAST SEEN TYPE REASON OBJECT MESSAGE 12s Warning BackOff pod\/api-gateway-7f8d9b-x2k Back-off restarting failed container 4s Warning Unhealthy pod\/auth-svc-66c4d-99z Liveness probe failed: HTTP probe failed with statuscode: 503 1s Normal Killing pod\/payment-worker-88v Stopping container payment-worker 0s Warning EvictionThreshold node\/ip-10-0-42-101.ec2.internal The node was low on resource: memory. [124892.12] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=\/,mems_allowed=0,oom_memcg=\/kubepods.slice\/kubepods-burstable.slice\/kubepods-burstable-pod-a1&#8230; [124892.15] Memory cgroup out of memory: Killed process &#8230; <a title=\"10 Kubernetes Best Practices for Production Success\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\" aria-label=\"Read more  on 10 Kubernetes Best Practices for Production Success\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4775","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>10 Kubernetes Best Practices for Production Success - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"10 Kubernetes Best Practices for Production Success - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"LAST SEEN TYPE REASON OBJECT MESSAGE 12s Warning BackOff pod\/api-gateway-7f8d9b-x2k Back-off restarting failed container 4s Warning Unhealthy pod\/auth-svc-66c4d-99z Liveness probe failed: HTTP probe failed with statuscode: 503 1s Normal Killing pod\/payment-worker-88v Stopping container payment-worker 0s Warning EvictionThreshold node\/ip-10-0-42-101.ec2.internal The node was low on resource: memory. [124892.12] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=\/,mems_allowed=0,oom_memcg=\/kubepods.slice\/kubepods-burstable.slice\/kubepods-burstable-pod-a1... [124892.15] Memory cgroup out of memory: Killed process ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-30T16:26:46+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"10 Kubernetes Best Practices for Production Success\",\"datePublished\":\"2026-04-30T16:26:46+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\"},\"wordCount\":1790,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\",\"name\":\"10 Kubernetes Best Practices for Production Success - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-04-30T16:26:46+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"10 Kubernetes Best Practices for Production Success\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"10 Kubernetes Best Practices for Production Success - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/","og_locale":"en_US","og_type":"article","og_title":"10 Kubernetes Best Practices for Production Success - ITSupportWale","og_description":"LAST SEEN TYPE REASON OBJECT MESSAGE 12s Warning BackOff pod\/api-gateway-7f8d9b-x2k Back-off restarting failed container 4s Warning Unhealthy pod\/auth-svc-66c4d-99z Liveness probe failed: HTTP probe failed with statuscode: 503 1s Normal Killing pod\/payment-worker-88v Stopping container payment-worker 0s Warning EvictionThreshold node\/ip-10-0-42-101.ec2.internal The node was low on resource: memory. [124892.12] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=\/,mems_allowed=0,oom_memcg=\/kubepods.slice\/kubepods-burstable.slice\/kubepods-burstable-pod-a1... [124892.15] Memory cgroup out of memory: Killed process ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-04-30T16:26:46+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"10 Kubernetes Best Practices for Production Success","datePublished":"2026-04-30T16:26:46+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/"},"wordCount":1790,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/","url":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/","name":"10 Kubernetes Best Practices for Production Success - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-04-30T16:26:46+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"10 Kubernetes Best Practices for Production Success"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4775","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4775"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4775\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4775"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4775"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4775"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}