{"id":4736,"date":"2026-03-16T21:42:47","date_gmt":"2026-03-16T16:12:47","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/"},"modified":"2026-03-16T21:42:47","modified_gmt":"2026-03-16T16:12:47","slug":"what-is-kubernetes-a-simple-guide-to-container-orchestration","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/","title":{"rendered":"What is Kubernetes? A Simple Guide to Container Orchestration"},"content":{"rendered":"<p>$ kubectl get pods -A<br \/>\nNAMESPACE              NAME                                           READY   STATUS                            RESTARTS         AGE<br \/>\nkube-system            coredns-7689d884b-l2v98                        0\/1     CrashLoopBackOff                  42 (3m ago)      72h<br \/>\nkube-system            kube-proxy-z4m2n                               0\/1     Error                             15               72h<br \/>\nproduction             api-gateway-v2-7f5d9c8d4b-9w2k1                0\/2     ImagePullBackOff                  0                14m<br \/>\nproduction             order-processor-5566778899-abc12               0\/1     CreateContainerConfigError        0                12m<br \/>\nproduction             payment-service-8899aabbcc-xyz34               0\/1     Terminating                       0                72h<br \/>\nproduction             auth-service-66778899aa-def56                  0\/1     Pending                           0                5m<br \/>\nmonitoring             prometheus-server-0                            0\/1     CrashLoopBackOff                  112              72h<br \/>\ningress-nginx          ingress-nginx-controller-646d5d4d54-m9s2z      0\/1     ValidatingWebhookConfiguration    0                2m<br \/>\nkube-system            etcd-ip-10-0-64-12.ec2.internal                0\/1     Error                             9                72h<\/p>\n<p>$ kubectl describe node ip-10-0-64-12.ec2.internal<br \/>\nName:               ip-10-0-64-12.ec2.internal<br \/>\nStatus:             Ready<br \/>\nConditions:<br \/>\n  Type             Status  LastHeartbeatTime                 Reason                       Message<br \/>\n  &#8212;-             &#8212;&#8212;  &#8212;&#8212;&#8212;&#8212;&#8212;&#8211;                 &#8212;&#8212;                       &#8212;&#8212;-<br \/>\n  NetworkUnavailable False   Thu, 24 Oct 2024 03:14:22 +0000   RouteCreated                 RouteCreated.<br \/>\n  MemoryPressure   True    Thu, 24 Oct 2024 04:45:10 +0000   KubeletHasInsufficientMemory kubelet has insufficient memory available<br \/>\n  DiskPressure     False   Thu, 24 Oct 2024 04:45:10 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure<br \/>\n  PIDPressure      False   Thu, 24 Oct 2024 04:45:10 +0000   KubeletHasNoPidPressure      kubelet has no pid pressure<br \/>\n  Ready            False   Thu, 24 Oct 2024 04:45:10 +0000   KubeletNotReady              runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized<br \/>\nEvents:<br \/>\n  Type     Reason                  Age                From                               Message<br \/>\n  &#8212;-     &#8212;&#8212;                  &#8212;-               &#8212;-                               &#8212;&#8212;-<br \/>\n  Warning  EvictionThresholdMet    5m                 kubelet                            Threshold observed at memory.available=482Mi, boundary=500Mi<\/p>\n<pre class=\"codehilite\"><code>The sun is coming up. Or maybe it\u2019s going down. I can\u2019t tell because the blinds are drawn and the only light in this room comes from a 32-inch monitor displaying a wall of red text and a terminal window that looks like a crime scene. My hands are shaking, not from caffeine\u2014though I\u2019ve consumed enough cold espresso to stop a horse\u2019s heart\u2014but from the sheer, unadulterated adrenaline of watching 4,000 nodes commit collective suicide because of a single character error in a ValidatingWebhookConfiguration.\n\nYou might be asking **what is** the point of all this abstraction? Why do we subject ourselves to this? We took the relatively simple problem of running a binary on a server and wrapped it in fourteen layers of YAML, virtual networking, and distributed consensus algorithms until it became a sentient beast that hates us. \n\nKubernetes version 1.30 was supposed to be &quot;stable.&quot; They talked about &quot;Structured Authentication&quot; and &quot;Node Log Query.&quot; They didn't talk about how, when the admission controller goes dark, the API server starts choking on its own tongue, and the entire control plane turns into a circular firing squad.\n\n## The Control Plane: The Brain That Forgets\n\nThe Control Plane is marketed as the &quot;brain&quot; of the cluster. In reality, it\u2019s a collection of anxious bureaucrats who refuse to talk to each other unless they have a signed certificate and a 200 OK response. At the center sits `kube-apiserver`. It is the only thing that talks to the database. Everything else\u2014the scheduler, the controller manager, your frantic `kubectl` commands\u2014is just a client.\n\nWhen the outage hit at 3 AM three days ago, the `kube-apiserver` wasn't just failing; it was screaming. A misconfigured admission controller\u2014a piece of code meant to &quot;validate&quot; objects before they are persisted\u2014was pointing to a service that didn't exist anymore. Because the webhook was set to `failurePolicy: Fail`, the API server stopped accepting *any* pod updates. \n\n[INTERNAL MONOLOGUE: Why did we let the junior dev touch the webhooks? Why did I approve the PR? I was thinking about lunch. I was thinking about a sandwich while I signed the death warrant for our production environment.]\n\nIn v1.30, the API server is more &quot;efficient,&quot; which just means it fails faster. When the webhook timed out, the request latencies spiked. The `kube-controller-manager` noticed the nodes weren't reporting in because their status updates were being rejected by the same broken webhook. It did what it was programmed to do: it assumed the nodes were dead and started rescheduling 10,000 pods. But it couldn't create the new pods because\u2014you guessed it\u2014the webhook was failing.\n\nHere is the `ResourceQuota` we had in place, which did absolutely nothing to stop the cascading failure because the failure happened before the quota was even checked:\n\n```yaml\napiVersion: v1\nkind: ResourceQuota\nmetadata:\n  name: compute-resources\n  namespace: production\nspec:\n  hard:\n    requests.cpu: &quot;200&quot;\n    requests.memory: 500Gi\n    limits.cpu: &quot;400&quot;\n    limits.memory: 800Gi\n    pods: &quot;1000&quot;\n    services: &quot;50&quot;\n    replicationcontrollers: &quot;20&quot;\n    resourcequotas: &quot;1&quot;\n<\/code><\/pre>\n<p>The Control Plane is a lie. It\u2019s a series of loops. The &#8220;Reconciliation Loop&#8221; is just a fancy way of saying &#8220;I\u2019m going to keep trying to do this thing until I die or the universe ends.&#8221; When the state in <code>etcd<\/code> doesn&#8217;t match the state on the ground, the controllers panic.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a288d0b3fff4\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a288d0b3fff4\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><ul class='ez-toc-list-level-2' ><li class='ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#Etcd_The_Consistency_Nightmare\" >Etcd: The Consistency Nightmare<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#The_Kubelet_The_Overworked_Janitor\" >The Kubelet: The Overworked Janitor<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#Container_Runtime_The_Shaky_Foundation\" >Container Runtime: The Shaky Foundation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#CNI_The_Plumbing_from_Hell\" >CNI: The Plumbing from Hell<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#The_Admission_Controller_The_Gatekeeper_with_a_Grudge\" >The Admission Controller: The Gatekeeper with a Grudge<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#The_PodDisruptionBudget_The_Final_Insult\" >The PodDisruptionBudget: The Final Insult<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#Linux_Primitives_The_Real_World_Under_the_Hood\" >Linux Primitives: The Real World Under the Hood<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#What_is_the_point\" >What is the point?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#Finally_Everything_is_%E2%80%9CReady%E2%80%9D\" >Finally. Everything is &#8220;Ready.&#8221;<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#But_we_all_know_that_%E2%80%9CReady%E2%80%9D_is_just_a_temporary_state_between_disasters\" >But we all know that &#8220;Ready&#8221; is just a temporary state between disasters.<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Etcd_The_Consistency_Nightmare\"><\/span>Etcd: The Consistency Nightmare<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If the API server is the brain, <code>etcd<\/code> is the memory. And our memory is currently corrupted by a thousand &#8220;deadline exceeded&#8221; errors. <code>etcd<\/code> uses the Raft consensus algorithm. It requires a majority to agree on anything. If you have three nodes and two of them stop talking because the underlying EBS volume decided to have a &#8220;moment,&#8221; your cluster is a brick.<\/p>\n<p>During the height of the outage, <code>etcd<\/code> was reporting disk sync durations in the seconds. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># journalctl -u etcd\nOct 24 02:15:10 etcd-node-1 etcd[1234]: slow HTTP response from 10.0.64.12:2379 took 2.4512s\nOct 24 02:15:12 etcd-node-1 etcd[1234]: failed to send out heartbeat on time (exceeded 100ms)\nOct 24 02:15:12 etcd-node-1 etcd[1234]: server is likely overloaded\n<\/code><\/pre>\n<p>When <code>etcd<\/code> lags, the world stops. The API server can\u2019t write the &#8220;I\u2019m alive&#8221; heartbeat from the Kubelet. The Control Plane thinks the node is gone. It marks it as <code>Unknown<\/code>. It tries to move the work. But the work can&#8217;t move. You end up with &#8220;Ghost Pods&#8221;\u2014containers running on a node that the API server thinks is empty, while the scheduler tries to cram more containers onto that same exhausted hardware.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Kubelet_The_Overworked_Janitor\"><\/span>The Kubelet: The Overworked Janitor<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>On every single node, there is a <code>kubelet<\/code>. This is the most honest piece of software in the whole stack. It doesn&#8217;t care about &#8220;service meshes&#8221; or &#8220;serverless.&#8221; It just wants to run containers. It watches the API server for pods assigned to its node, and then it talks to the container runtime to make it happen.<\/p>\n<p>But the <code>kubelet<\/code> is a snitch. It constantly reports back on the health of the node. When the network is saturated because your CNI is flapping, the <code>kubelet<\/code> can&#8217;t send its heartbeats.<\/p>\n<p>[INTERNAL MONOLOGUE: I can hear the fans in the server room from here. They sound like a jet engine taking off. That\u2019s the sound of the Kubelet trying to calculate cgroup metrics while the CPU is throttled to 10%.]<\/p>\n<p>Look at this <code>journalctl<\/code> output from one of the dying nodes. This is what a nervous breakdown looks like in Go:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># journalctl -u kubelet -f\nOct 24 04:50:01 node-1 kubelet[998]: E1024 04:50:01.123456    998 pod_workers.go:1294] &quot;Error syncing pod, skipping&quot; err=&quot;failed to &quot;StartContainer&quot; for &quot;runtime&quot; with CrashLoopBackOff: &quot;back-off 5m0s restarting failed container=api-gateway pod=api-gateway-v2-7f5d9c8d4b-9w2k1_production&quot;&quot;\nOct 24 04:50:05 node-1 kubelet[998]: I1024 04:50:05.555555    998 status_manager.go:652] &quot;Failed to update status&quot; pod=&quot;payment-service-8899aabbcc-xyz34&quot; err=&quot;node \\&quot;ip-10-0-64-12.ec2.internal\\&quot; not found&quot;\nOct 24 04:50:10 node-1 kubelet[998]: E1024 04:50:10.888888    998 kubelet.go:2450] &quot;Error getting node&quot; err=&quot;node \\&quot;ip-10-0-64-12.ec2.internal\\&quot; not found&quot;\n<\/code><\/pre>\n<p>The <code>kubelet<\/code> is trying to update the status of a pod, but the API server is telling it that the node it\u2019s running on <em>doesn&#8217;t exist<\/em>. This is the gaslighting of the SRE. You are staring at a server, logged in via SSH, and the system is telling you the server is a figment of your imagination.<\/p>\n<p>In v1.30, the Kubelet has better handling for memory swap, but that doesn&#8217;t help when your <code>containerd<\/code> socket is unresponsive because the kernel is OOM-killing the runtime itself.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Container_Runtime_The_Shaky_Foundation\"><\/span>Container Runtime: The Shaky Foundation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Kubernetes doesn&#8217;t actually run containers. It asks <code>containerd<\/code> or <code>CRI-O<\/code> to do it. This is the Container Runtime Interface (CRI). It\u2019s another layer of indirection. When you see <code>ImagePullBackOff<\/code>, it\u2019s usually not because the image isn&#8217;t there. It\u2019s because the runtime\u2019s credentials expired, or the CNI failed to set up the bridge interface, or the disk is so slow that the extraction of the layer timed out.<\/p>\n<p>We had a <code>Deployment<\/code> that looked like this. It\u2019s a standard piece of garbage, full of &#8220;best practices&#8221; that become &#8220;worst nightmares&#8221; during an outage:<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: api-gateway-v2\n  namespace: production\nspec:\n  replicas: 10\n  selector:\n    matchLabels:\n      app: api-gateway\n  template:\n    metadata:\n      labels:\n        app: api-gateway\n    spec:\n      containers:\n      - name: gateway\n        image: our-priv-reg.io\/api-gateway:v2.1.4\n        ports:\n        - containerPort: 8080\n        livenessProbe:\n          httpGet:\n            path: \/healthz\n            port: 8080\n          initialDelaySeconds: 15\n          periodSeconds: 20\n          timeoutSeconds: 5\n          failureThreshold: 3\n        readinessProbe:\n          httpGet:\n            path: \/ready\n            port: 8080\n          initialDelaySeconds: 5\n          periodSeconds: 10\n        resources:\n          requests:\n            cpu: &quot;500m&quot;\n            memory: &quot;1Gi&quot;\n          limits:\n            cpu: &quot;1000m&quot;\n            memory: &quot;2Gi&quot;\n<\/code><\/pre>\n<p>During the outage, the <code>livenessProbe<\/code> was the killer. Because the network was congested, the probes timed out. The <code>kubelet<\/code> then killed the container. This triggered a restart. The restart triggered an image pull. The image pull triggered more network traffic. The network traffic caused more probe timeouts. It\u2019s a self-amplifying feedback loop of failure. <\/p>\n<p>[INTERNAL MONOLOGUE: I should have used <code>startupProbes<\/code>. I knew it. I wrote the documentation on it. And yet, here I am, watching my own creation choke itself to death.]<\/p>\n<h2><span class=\"ez-toc-section\" id=\"CNI_The_Plumbing_from_Hell\"><\/span>CNI: The Plumbing from Hell<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The Container Network Interface (CNI) is where the real dark magic happens. This is what allows a pod on Node A to talk to a pod on Node B. It involves BGP, VXLAN, Geneve, or just a massive pile of <code>iptables<\/code> rules that nobody understands.<\/p>\n<p>When the CNI fails, it fails silently. You\u2019ll see &#8220;Ready&#8221; nodes, but no traffic flows. In our case, the CNI (we\u2019re using Cilium, because we like complexity) couldn&#8217;t allocate IPs because the <code>CiliumNode<\/code> custom resource couldn&#8217;t be updated in the API server. <\/p>\n<p>What is a network in Kubernetes? It\u2019s a hallucination. It\u2019s a series of virtual interfaces (<code>veth<\/code> pairs) and routing table entries that are constantly being rewritten. In v1.30, there are improvements in how the <code>NodeIPAM<\/code> handles ranges, but that doesn&#8217;t matter when your <code>iptables-restore<\/code> command is taking 30 seconds to run because you have 50,000 services.<\/p>\n<p>Every time a pod is created, the CNI has to:<br \/>\n1. Create a network namespace.<br \/>\n2. Create a <code>veth<\/code> pair.<br \/>\n3. Attach one end to the container and the other to the host bridge or OVS.<br \/>\n4. Assign an IP.<br \/>\n5. Set up routes.<br \/>\n6. Configure NAT for egress.<\/p>\n<p>If any of those steps fail\u2014say, because the kernel is locked up trying to process a million packets\u2014the pod stays in <code>ContainerCreating<\/code> forever.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Admission_Controller_The_Gatekeeper_with_a_Grudge\"><\/span>The Admission Controller: The Gatekeeper with a Grudge<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This was the source of the 72-hour hell. Admission controllers are plugins that govern what the API server allows. There are built-in ones (like <code>ResourceQuota<\/code>) and &#8220;Dynamic&#8221; ones (Mutating and Validating Webhooks).<\/p>\n<p>We use a Validating Webhook to ensure no one deploys a container as <code>root<\/code>. It\u2019s a noble goal. But the webhook is an external service. When that service\u2019s pod was evicted due to &#8220;Memory Pressure&#8221; (caused by the very outage it was about to exacerbate), the API server couldn&#8217;t reach it.<\/p>\n<p>Because the webhook was configured like this:<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: admissionregistration.k8s.io\/v1\nkind: ValidatingWebhookConfiguration\nmetadata:\n  name: security-policy-checker\nwebhooks:\n  - name: check-privileges.example.com\n    rules:\n      - operations: [&quot;CREATE&quot;, &quot;UPDATE&quot;]\n        apiGroups: [&quot;&quot;]\n        apiVersions: [&quot;v1&quot;]\n        resources: [&quot;pods&quot;]\n    clientConfig:\n      service:\n        namespace: security\n        name: policy-webhook\n    failurePolicy: Fail # &lt;--- THIS IS THE LOADED GUN\n    sideEffects: None\n    admissionReviewVersions: [&quot;v1&quot;]\n    timeoutSeconds: 30\n<\/code><\/pre>\n<p>The <code>failurePolicy: Fail<\/code> meant that if the webhook didn&#8217;t respond, the Pod creation\/update was rejected. Since the webhook itself was a Pod, and it was down, the scheduler couldn&#8217;t restart it because the API server couldn&#8217;t validate the new Pod. It was a deadlock. A perfect, beautiful circle of nothingness.<\/p>\n<p>I had to manually edit the <code>ValidatingWebhookConfiguration<\/code> via <code>kubectl<\/code> while the API server was timing out, just to set that policy to <code>Ignore<\/code>. It took two hours just to get that one command to go through.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_PodDisruptionBudget_The_Final_Insult\"><\/span>The PodDisruptionBudget: The Final Insult<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When I finally tried to drain the nodes to reset the runtime, I was stopped by the <code>PodDisruptionBudget<\/code> (PDB).<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: policy\/v1\nkind: PodDisruptionBudget\nmetadata:\n  name: api-pdb\n  namespace: production\nspec:\n  minAvailable: 80%\n  selector:\n    matchLabels:\n      app: api-gateway\n<\/code><\/pre>\n<p>The PDB said &#8220;You cannot take this node down because it would drop the availability of the API Gateway below 80%.&#8221; But the API Gateway was already at 0% availability because of the <code>CrashLoopBackOff<\/code>. The PDB doesn&#8217;t care if the pods are actually <em>working<\/em>; it only cares if they <em>exist<\/em>. So I couldn&#8217;t drain the nodes to fix the nodes because the nodes were broken.<\/p>\n<p>I had to delete the PDBs. I had to delete the webhooks. I had to delete my own pride.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Linux_Primitives_The_Real_World_Under_the_Hood\"><\/span>Linux Primitives: The Real World Under the Hood<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>At the end of the day, Kubernetes is just a very expensive wrapper around Linux primitives. If you strip away the YAML and the Go binaries, you are left with <code>namespaces<\/code> and <code>cgroups<\/code>.<\/p>\n<p><strong>Namespaces<\/strong> are the isolation.<br \/>\n&#8211; <code>mnt<\/code>: Different filesystems.<br \/>\n&#8211; <code>net<\/code>: Different network stacks.<br \/>\n&#8211; <code>pid<\/code>: The container thinks it\u2019s PID 1.<br \/>\n&#8211; <code>uts<\/code>: Different hostnames.<br \/>\n&#8211; <code>ipc<\/code>: Isolated inter-process communication.<br \/>\n&#8211; <code>user<\/code>: Mapping root in the container to a nobody on the host.<\/p>\n<p><strong>Cgroups (Control Groups)<\/strong> are the resource limits. In v1.30, we are fully into the <code>cgroup v2<\/code> era. This is what handles the memory limits that OOM-kill your Java apps. When you set a memory limit of 2Gi, the kernel\u2019s OOM killer is watching that cgroup. The moment the resident set size (RSS) hits 2.00001Gi, the kernel sends a <code>SIGKILL<\/code>. <\/p>\n<p>Kubernetes tries to be smart about this, but the kernel is brutal. There is no &#8220;graceful shutdown&#8221; for an OOM kill. The process is just gone. The <code>kubelet<\/code> sees the process is gone, looks at the exit code (137), and realizes it was an OOM kill. Then it updates the pod status.<\/p>\n<p>[INTERNAL MONOLOGUE: I\u2019m staring at the <code>top<\/code> output on a node. The load average is 450. On a 64-core machine. The system is spending all its time in <code>iowait<\/code>. The disk is dying. The containers are dying. I am dying.]<\/p>\n<h2><span class=\"ez-toc-section\" id=\"What_is_the_point\"><\/span>What is the point?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>What is the point of a system that is so complex it requires a 72-hour war room to fix a single config change? We wanted &#8220;self-healing&#8221; infrastructure. What we got was a system that is very good at healing itself from small, predictable failures, but spectacularly good at accelerating large, unpredictable ones.<\/p>\n<p>Kubernetes v1.30 is a marvel of engineering. It is also a testament to our hubris. We have built a platform that can scale to 5,000 nodes but can be brought to its knees by a single malformed YAML file.<\/p>\n<p>I\u2019m going to finish this coffee. It\u2019s cold and tastes like battery acid. Then I\u2019m going to delete the remaining <code>Evicted<\/code> pods, check the <code>etcd<\/code> leader election metrics one last time, and go to sleep for a week. Or at least until the next PagerDuty alert at 3 AM.<\/p>\n<p>Because the &#8220;brain&#8221; never sleeps. It just waits for you to make a mistake.<\/p>\n<p>&#8220;`bash<br \/>\n$ kubectl get nodes<br \/>\nNAME                         STATUS   ROLES           AGE   VERSION<br \/>\nip-10-0-64-12.ec2.internal   Ready    control-plane   72h   v1.30.1<br \/>\nip-10-0-64-13.ec2.internal   Ready    worker          72h   v1.30.1<br \/>\nip-10-0-64-14.ec2.internal   Ready    worker          72h   v1.30.1<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/how-to-install-latest-php-7-3-on-ubuntu-18-04\/\">How To Install Latest Php 7 3 On Ubuntu 18 04<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/mastering-machine-learning-models-types-and-use-cases\/\">Mastering Machine Learning Models Types And Use Cases<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/javascript-best-practices-write-cleaner-efficient-code\/\">Javascript Best Practices Write Cleaner Efficient Code<\/a><\/li>\n<\/ul>\n<h1><span class=\"ez-toc-section\" id=\"Finally_Everything_is_%E2%80%9CReady%E2%80%9D\"><\/span>Finally. Everything is &#8220;Ready.&#8221;<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<h1><span class=\"ez-toc-section\" id=\"But_we_all_know_that_%E2%80%9CReady%E2%80%9D_is_just_a_temporary_state_between_disasters\"><\/span>But we all know that &#8220;Ready&#8221; is just a temporary state between disasters.<span class=\"ez-toc-section-end\"><\/span><\/h1>\n","protected":false},"excerpt":{"rendered":"<p>$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-7689d884b-l2v98 0\/1 CrashLoopBackOff 42 (3m ago) 72h kube-system kube-proxy-z4m2n 0\/1 Error 15 72h production api-gateway-v2-7f5d9c8d4b-9w2k1 0\/2 ImagePullBackOff 0 14m production order-processor-5566778899-abc12 0\/1 CreateContainerConfigError 0 12m production payment-service-8899aabbcc-xyz34 0\/1 Terminating 0 72h production auth-service-66778899aa-def56 0\/1 Pending 0 5m monitoring prometheus-server-0 0\/1 CrashLoopBackOff 112 72h &#8230; <a title=\"What is Kubernetes? A Simple Guide to Container Orchestration\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\" aria-label=\"Read more  on What is Kubernetes? A Simple Guide to Container Orchestration\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4736","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>What is Kubernetes? A Simple Guide to Container Orchestration - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What is Kubernetes? A Simple Guide to Container Orchestration - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-7689d884b-l2v98 0\/1 CrashLoopBackOff 42 (3m ago) 72h kube-system kube-proxy-z4m2n 0\/1 Error 15 72h production api-gateway-v2-7f5d9c8d4b-9w2k1 0\/2 ImagePullBackOff 0 14m production order-processor-5566778899-abc12 0\/1 CreateContainerConfigError 0 12m production payment-service-8899aabbcc-xyz34 0\/1 Terminating 0 72h production auth-service-66778899aa-def56 0\/1 Pending 0 5m monitoring prometheus-server-0 0\/1 CrashLoopBackOff 112 72h ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-03-16T16:12:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"What is Kubernetes? A Simple Guide to Container Orchestration\",\"datePublished\":\"2026-03-16T16:12:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\"},\"wordCount\":1787,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\",\"name\":\"What is Kubernetes? A Simple Guide to Container Orchestration - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-03-16T16:12:47+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What is Kubernetes? A Simple Guide to Container Orchestration\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What is Kubernetes? A Simple Guide to Container Orchestration - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/","og_locale":"en_US","og_type":"article","og_title":"What is Kubernetes? A Simple Guide to Container Orchestration - ITSupportWale","og_description":"$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-7689d884b-l2v98 0\/1 CrashLoopBackOff 42 (3m ago) 72h kube-system kube-proxy-z4m2n 0\/1 Error 15 72h production api-gateway-v2-7f5d9c8d4b-9w2k1 0\/2 ImagePullBackOff 0 14m production order-processor-5566778899-abc12 0\/1 CreateContainerConfigError 0 12m production payment-service-8899aabbcc-xyz34 0\/1 Terminating 0 72h production auth-service-66778899aa-def56 0\/1 Pending 0 5m monitoring prometheus-server-0 0\/1 CrashLoopBackOff 112 72h ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-03-16T16:12:47+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"What is Kubernetes? A Simple Guide to Container Orchestration","datePublished":"2026-03-16T16:12:47+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/"},"wordCount":1787,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/","url":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/","name":"What is Kubernetes? A Simple Guide to Container Orchestration - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-03-16T16:12:47+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-container-orchestration\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"What is Kubernetes? A Simple Guide to Container Orchestration"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4736"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4736\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}