{"id":4485,"date":"2026-01-29T21:15:15","date_gmt":"2026-01-29T15:45:15","guid":{"rendered":"https:\/\/www.itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/"},"modified":"2026-02-17T15:55:51","modified_gmt":"2026-02-17T10:25:51","slug":"10-kubernetes-best-practices-for-production-success","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/","title":{"rendered":"10 Kubernetes Best Practices for Production Success"},"content":{"rendered":"<p><strong>INCIDENT REPORT: POST-MORTEM #882-B (THE &#8220;FRIDAY AFTERNOON ARCHITECT SPECIAL&#8221;)<\/strong><br \/>\n<strong>TIMESTAMP:<\/strong> 2024-05-17T03:04:12Z<br \/>\n<strong>INITIAL ALERT:<\/strong> <code>CRITICAL - PagerDuty - Service: checkout-api - Severity: 1 - Status: FAILED<\/code><br \/>\n<strong>FIRST LOG ENTRY:<\/strong><br \/>\n<code>kubelet[1024]: E0517 03:04:12.442103 1024 remote_runtime.go:116] \"RunPodSandbox from runtime service failed\" err=\"rpc error: code = Unknown desc = failed to setup network for sandbox: failed to allocate for range 0: no IP addresses available in range set: 10.244.2.0-10.244.2.255\"<\/code><\/p>\n<hr \/>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69d85401999cb\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69d85401999cb\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#1_The_3_AM_Meltdown_Why_Everything_Broke_at_Once\" >1. The 3 AM Meltdown: Why Everything Broke at Once<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#2_Resource_Limits_The_Lie_We_Tell_Ourselves\" >2. Resource Limits: The Lie We Tell Ourselves<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#3_Probes_are_Not_Optional_Liveness_vs_Readiness_vs_Reality\" >3. Probes are Not Optional: Liveness vs. Readiness vs. Reality<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#4_Security_Contexts_Why_Are_You_Still_Running_as_Root\" >4. Security Contexts: Why Are You Still Running as Root?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#5_The_Networking_Rabbit_Hole_CNI_Failures_and_Latency\" >5. The Networking Rabbit Hole: CNI Failures and Latency<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#6_The_%E2%80%9CKubernetes_Best%E2%80%9D_Practices_We_Ignored_And_Why_Were_Fired\" >6. The &#8220;Kubernetes Best&#8221; Practices We Ignored (And Why We\u2019re Fired)<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"1_The_3_AM_Meltdown_Why_Everything_Broke_at_Once\"><\/span>1. The 3 AM Meltdown: Why Everything Broke at Once<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>I was finally in a REM cycle when the pager started screaming. Not the &#8220;disk is at 80%&#8221; chirp. The &#8220;your entire regional cluster is a smoking crater&#8221; siren. I opened my laptop, eyes burning like I\u2019d stared into a solar flare, and saw the Slack channel already filled with &#8220;Architects&#8221; asking if we\u2019d &#8220;tried restarting the pods.&#8221; <\/p>\n<p>Restarting the pods. Brilliant. Why didn&#8217;t I think of that? Oh, wait, I did, and the <code>kube-scheduler<\/code> is currently having a nervous breakdown because someone decided to push a &#8220;minor optimization&#8221; to our Kubernetes 1.29 control plane at 4:45 PM on a Friday.<\/p>\n<p>The failure wasn&#8217;t a single point. It was a cascading failure of epic proportions. It started with the <code>checkout-api<\/code> deployment. Our &#8220;Architectural Lead&#8221;\u2014who I\u2019m convinced hasn&#8217;t touched a CLI since the Obama administration\u2014decided that we needed to &#8220;maximize density.&#8221; They stripped out all the resource requests and limits because &#8220;Kubernetes is smart enough to figure it out.&#8221; <\/p>\n<p>Narrator: It was not smart enough to figure it out.<\/p>\n<p>By 03:00, the nodes were so oversubscribed that the Linux kernel started executing processes like it was the French Revolution. The OOM Killer was the only thing working at full capacity. Because there were no resource boundaries, a single memory leak in a Java sidecar (don&#8217;t ask why there&#8217;s a Java sidecar in a Go service) ballooned until it choked the <code>kubelet<\/code>. When the <code>kubelet<\/code> died, the node went <code>NotReady<\/code>. When the node went <code>NotReady<\/code>, the scheduler tried to move 400 pods to the remaining three nodes. <\/p>\n<p>You can guess what happened next. It was a digital suicide pact.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl get nodes\nNAME             STATUS     ROLES           AGE   VERSION\nip-10-0-1-12     NotReady   worker          45d   v1.29.2\nip-10-0-1-13     NotReady   worker          45d   v1.29.2\nip-10-0-1-14     NotReady   worker          45d   v1.29.2\nip-10-0-1-15     Ready      control-plane   45d   v1.29.2\n\n$ kubectl describe node ip-10-0-1-12 | grep -A 5 Events\nEvents:\n  Type     Reason                   Age                From     Message\n  ----     ------                   ----               ----     -------\n  Warning  EvictionThresholdMet     14m                kubelet  Attempting to reclaim memory\n  Normal   NodeHasInsufficientMemory 14m (x20 over 2h)  kubelet  Node ip-10-0-1-12 status is now: NodeHasInsufficientMemory\n<\/code><\/pre>\n<p>The &#8220;Architects&#8221; wanted density. They got it. They got 100% density of failure.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"2_Resource_Limits_The_Lie_We_Tell_Ourselves\"><\/span>2. Resource Limits: The Lie We Tell Ourselves<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Let\u2019s talk about the YAML I found in the <code>checkout-api<\/code> manifest. It\u2019s a work of art if your medium is &#8220;pure incompetence.&#8221; In Kubernetes 1.29, we have sophisticated cgroup v2 support, better memory pressure handling, and what do we do? We leave the <code>resources<\/code> block empty.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># The &quot;Architect&quot; special\napiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: checkout-api\nspec:\n  template:\n    spec:\n      containers:\n      - name: app\n        image: checkout:latest # Because who needs versioning?\n        resources: {} # &quot;Let the cluster decide,&quot; they said.\n<\/code><\/pre>\n<p>When you don&#8217;t define <code>requests<\/code>, the scheduler treats the pod as <code>BestEffort<\/code>. In the hierarchy of &#8220;Who gets killed first when the node is sweating,&#8221; <code>BestEffort<\/code> is the first against the wall. But it\u2019s worse than that. Without <code>limits<\/code>, the container will try to consume every byte of RAM on the host. <\/p>\n<p>I spent four hours watching <code>dmesg<\/code> logs on a dying node. The CPU throttling was so aggressive due to CFS (Completely Fair Scheduler) quotas being mismanaged by the lack of limits that the application\u2019s response time went from 20ms to 15,000ms. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Checking the carnage on the node\n$ journalctl -u kubelet --since &quot;1 hour ago&quot; | grep &quot;OOM&quot;\nMay 17 03:15:22 node-1 kubelet[1024]: Task 'app' (pid 12345) killed due to OOMKill.\nMay 17 03:15:24 node-1 kubelet[1024]: Task 'sidecar' (pid 12360) killed due to OOMKill.\n<\/code><\/pre>\n<p>If you want to survive a production load, you use the <strong>kubernetes best<\/strong> approach: <strong>Guaranteed QoS<\/strong>. You set <code>requests<\/code> equal to <code>limits<\/code>. This tells the scheduler, &#8220;Do not put this pod on a node unless you can actually give me this memory.&#8221; It prevents the &#8220;noisy neighbor&#8221; syndrome where a dev-test pod starts eating the production database&#8217;s memory. But no, that would be &#8220;too expensive.&#8221; You know what\u2019s expensive? Being down for 48 hours while I manually prune dead containers from the containerd shim.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"3_Probes_are_Not_Optional_Liveness_vs_Readiness_vs_Reality\"><\/span>3. Probes are Not Optional: Liveness vs. Readiness vs. Reality<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>While the nodes were burning, the Load Balancer was still sending traffic to pods that were in a <code>CrashLoopBackOff<\/code>. Why? Because the &#8220;Architects&#8221; thought <code>livenessProbes<\/code> and <code>readinessProbes<\/code> were &#8220;boilerplate fluff.&#8221;<\/p>\n<p>Here is what I found in the production manifest:<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># Actual snippet from the wreckage\nreadinessProbe:\n  httpGet:\n    path: \/health\n    port: 8080\n  initialDelaySeconds: 0\n  periodSeconds: 1\n<\/code><\/pre>\n<p>They set <code>initialDelaySeconds<\/code> to 0 for a Java application that takes 90 seconds to warm up its JVM and connect to the connection pool. The result? The <code>kubelet<\/code> started hitting the <code>\/health<\/code> endpoint before the app was even finished loading its classes. The app didn&#8217;t respond, so the <code>kubelet<\/code> marked it as unready. Then, because they used the same logic for a <code>livenessProbe<\/code>, the <code>kubelet<\/code> killed the container and restarted it. <\/p>\n<p>It was an infinite loop of death. The app would start, get poked by the <code>kubelet<\/code>, fail to respond instantly, get killed, and start over. <\/p>\n<p>I had to explain\u2014for the tenth time this year\u2014that a <code>readinessProbe<\/code> tells the Service &#8220;don&#8217;t send me traffic yet,&#8221; while a <code>livenessProbe<\/code> tells the Kubelet &#8220;kill me, I&#8217;m stuck.&#8221; If you point them both at the same endpoint with a 1-second timeout, you are essentially DDoSing your own startup routine. <\/p>\n<p>In Kubernetes 1.29, we have <code>startupProbes<\/code>. Use them. They are designed for this exact scenario. They give the app time to breathe before the liveness probe starts swinging its axe. But that would require reading the documentation, which apparently isn&#8217;t as fun as writing &#8220;Thought Leadership&#8221; posts on LinkedIn.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"4_Security_Contexts_Why_Are_You_Still_Running_as_Root\"><\/span>4. Security Contexts: Why Are You Still Running as Root?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>As I was digging through the <code>kubectl describe pod<\/code> output to figure out why the CNI was failing, I noticed something even more horrifying. Every single pod was running with <code>privileged: true<\/code> or, at the very least, as the <code>root<\/code> user.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl get pod checkout-api-6f789-abcde -o jsonpath='{.spec.containers[0].securityContext}'\n{&quot;privileged&quot;:true,&quot;runAsUser&quot;:0}\n<\/code><\/pre>\n<p>Why? &#8220;Because the app needs to write to a log file in <code>\/var\/log<\/code>.&#8221; <\/p>\n<p>I want to scream. It\u2019s 2024. We have <code>emptyDir<\/code> volumes. We have <code>fluentd<\/code> for log aggregation. There is zero reason\u2014NONE\u2014for a checkout service to have root access to the host&#8217;s kernel. Because these pods were privileged, when the memory leak happened, they weren&#8217;t just contained within their own cgroup; they were able to interfere with host-level processes. <\/p>\n<p>One pod managed to trigger a kernel panic because it exhausted the host&#8217;s file descriptors. If we had used a proper <code>securityContext<\/code>, the container would have been capped. <\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># What a sane person would write\nsecurityContext:\n  runAsNonRoot: true\n  runAsUser: 1000\n  readOnlyRootFilesystem: true\n  allowPrivilegeEscalation: false\n  capabilities:\n    drop:\n      - ALL\n<\/code><\/pre>\n<p>But no, &#8220;security slows down development.&#8221; You know what else slows down development? Having the entire engineering team locked in a Zoom bridge for two days because a compromised or buggy container wiped the node&#8217;s root partition.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"5_The_Networking_Rabbit_Hole_CNI_Failures_and_Latency\"><\/span>5. The Networking Rabbit Hole: CNI Failures and Latency<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>By hour 20, we got the pods to stop crashing, but the latency was astronomical. We\u2019re talking 5 seconds for a simple DNS lookup. I checked CoreDNS.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl logs -n kube-system -l k8s-app=kube-dns\n[ERROR] plugin\/errors: 2 checkout-api.production.svc.cluster.local. A: read udp 10.244.0.15:53-&gt;10.244.0.2:45678: i\/o timeout\n<\/code><\/pre>\n<p>The &#8220;Architects&#8221; had configured the application to use <code>ndots: 5<\/code> in the <code>dnsConfig<\/code>. For those who don&#8217;t live in the trenches, this means for every single internal request, the client tries to resolve the name through five different search domains before it gives up and tries the absolute name. <\/p>\n<p><code>checkout-api<\/code> -&gt; <code>checkout-api.production.svc.cluster.local<\/code> -&gt; <code>checkout-api.svc.cluster.local<\/code> -&gt; <code>checkout-api.cluster.local<\/code>&#8230; <\/p>\n<p>Every single database query was generating five DNS queries. CoreDNS was being hammered so hard that the conntrack tables on the nodes were overflowing. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ dmesg | grep &quot;conntrack full&quot;\n[72431.123456] nf_conntrack: table full, dropping packet\n<\/code><\/pre>\n<p>I had to manually go in and patch the <code>coredns<\/code> ConfigMap to implement a stale cache and increase the memory limit. I also had to explain that we should be using <code>ExternalName<\/code> services or at least pointing to the FQDN with a trailing dot to bypass the <code>ndots<\/code> search hell. <\/p>\n<p>And then there was the CNI. We\u2019re using a standard overlay network, but because someone decided to set the MTU (Maximum Transmission Unit) incorrectly on the VPC, we were getting packet fragmentation. The pods could talk to each other if the payload was small, but as soon as a JSON response hit 1500 bytes, the packets were dropped silently. No error, just a timeout. <\/p>\n<p>I spent six hours with <code>tcpdump<\/code> and <code>wireshark<\/code> inside a debug container just to prove that it wasn&#8217;t &#8220;the network being slow&#8221; but rather &#8220;the network being misconfigured by people who think YAML is just a suggestion.&#8221;<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"6_The_%E2%80%9CKubernetes_Best%E2%80%9D_Practices_We_Ignored_And_Why_Were_Fired\"><\/span>6. The &#8220;Kubernetes Best&#8221; Practices We Ignored (And Why We\u2019re Fired)<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Finally, at hour 40, we started looking at the cluster&#8217;s overall health. The reason the failure was so &#8220;catastrophic&#8221; and not just &#8220;annoying&#8221; was that we ignored every single <strong>kubernetes best<\/strong> practice regarding high availability.<\/p>\n<p>First: <strong>Pod Topology Spread Constraints<\/strong>.<br \/>\nThe &#8220;Architects&#8221; didn&#8217;t define these. As a result, the scheduler\u2014in its infinite, unguided wisdom\u2014placed 90% of our critical pods on the same physical rack in the same availability zone. When that zone had a minor power blip, the entire &#8220;redundant&#8221; cluster went dark.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># What we should have had\ntopologySpreadConstraints:\n  - maxSkew: 1\n    topologyKey: topology.kubernetes.io\/zone\n    whenUnsatisfiable: DoNotSchedule\n    labelSelector:\n      matchLabels:\n        app: checkout-api\n<\/code><\/pre>\n<p>Second: <strong>Horizontal Pod Autoscaler (HPA) Misconfiguration<\/strong>.<br \/>\nThe HPA was configured to scale based on CPU at 50%. Sounds safe, right? Wrong. The application is IO-bound, not CPU-bound. During the outage, the CPU stayed low because the app was waiting on DNS timeouts. So the HPA started <em>scaling down<\/em> the number of pods because it thought the system was idle, which increased the load on the remaining pods, which caused them to OOM, which caused the HPA to scale down further. It was a &#8220;death spiral&#8221; orchestrated by a YAML file.<\/p>\n<p>Third: <strong>Taints and Tolerations<\/strong>.<br \/>\nWe have &#8220;special&#8221; nodes with NVMe drives for our databases. Someone forgot to add taints to them. So, a bunch of &#8220;Hello World&#8221; cronjobs from the marketing team&#8217;s experimental namespace got scheduled onto the high-performance database nodes, eating all the IOPS and starving the production DB.<\/p>\n<p>I\u2019m currently sitting in the office, the sun is coming up for the second time, and I\u2019m looking at a &#8220;Strategy Document&#8221; from the architects about &#8220;Moving to a Multi-Cluster Service Mesh.&#8221; <\/p>\n<p>I haven&#8217;t slept. I smell like stale coffee and failure. I have 400 tabs of <code>kubectl<\/code> logs open. And these people want to add <em>more<\/em> complexity? We can&#8217;t even get a <code>readinessProbe<\/code> right, and they want to implement Istio?<\/p>\n<p>Here is the reality: Kubernetes 1.29 is a rock-solid platform. It\u2019s the most stable, feature-rich version we\u2019ve ever had. It has the tools to prevent every single thing that happened this weekend. But Kubernetes is a mirror. If your engineering culture is a mess, your cluster will be a mess. If your architects think they are too good to understand how a cgroup works, your production environment will stay down.<\/p>\n<p>I\u2019m going home. If PagerDuty goes off again because someone changed the <code>imagePullPolicy<\/code> to <code>Always<\/code> on a 5GB image, I\u2019m throwing my phone into the river.<\/p>\n<p><strong>Resolution:<\/strong><br \/>\n1. Re-implemented <code>requests<\/code> and <code>limits<\/code> across all namespaces.<br \/>\n2. Added <code>startupProbes<\/code> to all JVM-based services.<br \/>\n3. Fixed the <code>ndots<\/code> issue in the global <code>dnsConfig<\/code>.<br \/>\n4. Applied <code>topologySpreadConstraints<\/code> to ensure multi-AZ resilience.<br \/>\n5. Deleted the &#8220;Architect&#8217;s&#8221; write access to the production repository.<\/p>\n<p><strong>Status:<\/strong> Cluster stable. SRE unstable.<\/p>\n<hr \/>\n<p><em>EOF &#8211; End of Report<\/em><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/backup-all-mysql-databases-with-a-mysql-backup-script\/\">Backup All Mysql Databases With A Mysql Backup Script<\/a><\/li>\n<li><a href=\"https:\/\/oracle.itsupportwale.com\/blog\/how-to-increase-migration-speed-in-office-365\/\">How To Increase Migration Speed In Office 365<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/creating-a-backup-user-with-read-only-permission-for-mysql-db\/\">Creating A Backup User With Read Only Permission For Mysql Db<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>INCIDENT REPORT: POST-MORTEM #882-B (THE &#8220;FRIDAY AFTERNOON ARCHITECT SPECIAL&#8221;) TIMESTAMP: 2024-05-17T03:04:12Z INITIAL ALERT: CRITICAL &#8211; PagerDuty &#8211; Service: checkout-api &#8211; Severity: 1 &#8211; Status: FAILED FIRST LOG ENTRY: kubelet[1024]: E0517 03:04:12.442103 1024 remote_runtime.go:116] &#8220;RunPodSandbox from runtime service failed&#8221; err=&#8221;rpc error: code = Unknown desc = failed to setup network for sandbox: failed to allocate for &#8230; <a title=\"10 Kubernetes Best Practices for Production Success\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\" aria-label=\"Read more  on 10 Kubernetes Best Practices for Production Success\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4485","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>10 Kubernetes Best Practices for Production Success - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"10 Kubernetes Best Practices for Production Success - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"INCIDENT REPORT: POST-MORTEM #882-B (THE &#8220;FRIDAY AFTERNOON ARCHITECT SPECIAL&#8221;) TIMESTAMP: 2024-05-17T03:04:12Z INITIAL ALERT: CRITICAL - PagerDuty - Service: checkout-api - Severity: 1 - Status: FAILED FIRST LOG ENTRY: kubelet[1024]: E0517 03:04:12.442103 1024 remote_runtime.go:116] &quot;RunPodSandbox from runtime service failed&quot; err=&quot;rpc error: code = Unknown desc = failed to setup network for sandbox: failed to allocate for ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-01-29T15:45:15+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-17T10:25:51+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"10 Kubernetes Best Practices for Production Success\",\"datePublished\":\"2026-01-29T15:45:15+00:00\",\"dateModified\":\"2026-02-17T10:25:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\"},\"wordCount\":1691,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\",\"name\":\"10 Kubernetes Best Practices for Production Success - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-01-29T15:45:15+00:00\",\"dateModified\":\"2026-02-17T10:25:51+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"10 Kubernetes Best Practices for Production Success\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"10 Kubernetes Best Practices for Production Success - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/","og_locale":"en_US","og_type":"article","og_title":"10 Kubernetes Best Practices for Production Success - ITSupportWale","og_description":"INCIDENT REPORT: POST-MORTEM #882-B (THE &#8220;FRIDAY AFTERNOON ARCHITECT SPECIAL&#8221;) TIMESTAMP: 2024-05-17T03:04:12Z INITIAL ALERT: CRITICAL - PagerDuty - Service: checkout-api - Severity: 1 - Status: FAILED FIRST LOG ENTRY: kubelet[1024]: E0517 03:04:12.442103 1024 remote_runtime.go:116] \"RunPodSandbox from runtime service failed\" err=\"rpc error: code = Unknown desc = failed to setup network for sandbox: failed to allocate for ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-01-29T15:45:15+00:00","article_modified_time":"2026-02-17T10:25:51+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"10 Kubernetes Best Practices for Production Success","datePublished":"2026-01-29T15:45:15+00:00","dateModified":"2026-02-17T10:25:51+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/"},"wordCount":1691,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/","url":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/","name":"10 Kubernetes Best Practices for Production Success - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-01-29T15:45:15+00:00","dateModified":"2026-02-17T10:25:51+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/10-kubernetes-best-practices-for-production-success\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"10 Kubernetes Best Practices for Production Success"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4485","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4485"}],"version-history":[{"count":3,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4485\/revisions"}],"predecessor-version":[{"id":4626,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4485\/revisions\/4626"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4485"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4485"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4485"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}