{"id":4487,"date":"2026-01-31T21:03:15","date_gmt":"2026-01-31T15:33:15","guid":{"rendered":"https:\/\/www.itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/"},"modified":"2026-02-10T11:44:44","modified_gmt":"2026-02-10T06:14:44","slug":"what-is-kubernetes-a-complete-guide-to-orchestration","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/","title":{"rendered":"What is Kubernetes? A Complete Guide to Orchestration"},"content":{"rendered":"<p><strong>TIMESTAMP: 2024-05-22 04:12:08 UTC<\/strong><br \/>\n<strong>STATUS: CRITICAL \/ DEGRADED<\/strong><br \/>\n<strong>INCIDENT ID: #8829-BETA-CASCADING-FAILURE<\/strong><br \/>\n<strong>OPERATOR: SRE_042 (COFFEE_LEVEL: CRITICAL)<\/strong><\/p>\n<p>The hum of the data center fans is a sound I can hear even in my own apartment now. It\u2019s a low-frequency vibration that lives in the base of my skull. I\u2019ve been staring at a Grafana dashboard for forty-eight hours, watching the red bars of 5xx errors crawl across the screen like a bloodstain. The &#8220;cloud-native&#8221; dream is currently a charred heap of discarded pods and failed health checks. <\/p>\n<p>Management keeps asking for a &#8220;high-level summary.&#8221; They want to know why the &#8220;self-healing&#8221; infrastructure didn&#8217;t heal. They want to know why the &#8220;magic&#8221; failed. One of the VPs, who probably thinks a container is something you put leftovers in, actually had the audacity to ask me, &#8220;What is Kubernetes, really, if it can\u2019t handle a simple traffic spike?&#8221;<\/p>\n<p>I\u2019m writing this because if I don\u2019t document the binary reality of this nightmare, the marketing team will spin it as a &#8220;learning opportunity&#8221; or some other corporate garbage. This isn&#8217;t a learning opportunity. This is a post-mortem of a system that is too complex for its own good, written by someone who has to keep it alive.<\/p>\n<hr \/>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69d83b57b6c35\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69d83b57b6c35\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#H2_INCIDENT-8829_Initial_Triage_The_Cascading_Failure_of_the_prod-us-east-1_Cluster\" >H2: [INCIDENT-8829] Initial Triage: The Cascading Failure of the prod-us-east-1 Cluster<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#H2_INCIDENT-8829_The_Reconciliation_Loop_A_Thermostat_in_a_Burning_Building\" >H2: [INCIDENT-8829] The Reconciliation Loop: A Thermostat in a Burning Building<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#H2_INCIDENT-8829_The_Etcd_State_and_the_Kube-Apiservers_Binary_Silence\" >H2: [INCIDENT-8829] The Etcd State and the Kube-Apiserver\u2019s Binary Silence<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#H2_INCIDENT-8829_Networking_Purgatory_CNI_Plugins_and_the_Kube-Proxy_Bottleneck\" >H2: [INCIDENT-8829] Networking Purgatory: CNI Plugins and the Kube-Proxy Bottleneck<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#H2_INCIDENT-8829_The_YAML_Purgatory_Deployment_Controllers_and_the_Indentation_of_Doom\" >H2: [INCIDENT-8829] The YAML Purgatory: Deployment Controllers and the Indentation of Doom<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#H2_INCIDENT-8829_Remediation_and_the_Bitter_Reality_of_%E2%80%9CSelf-Healing%E2%80%9D_Systems\" >H2: [INCIDENT-8829] Remediation and the Bitter Reality of &#8220;Self-Healing&#8221; Systems<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"H2_INCIDENT-8829_Initial_Triage_The_Cascading_Failure_of_the_prod-us-east-1_Cluster\"><\/span>H2: [INCIDENT-8829] Initial Triage: The Cascading Failure of the <code>prod-us-east-1<\/code> Cluster<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It started at 02:00 on Tuesday. Not with a bang, but with a slow climb in etcd commit latency. We\u2019re running Kubernetes v1.29.2 on bare metal, which means we don&#8217;t have a cloud provider to blame when the control plane starts choking. <\/p>\n<p>The first sign of trouble was the <code>kube-apiserver<\/code> becoming unresponsive. When the API server hangs, the heart of the cluster stops beating. I tried to run a basic diagnostic, and the terminal just mocked me:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl get nodes\nError from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)\n<\/code><\/pre>\n<p>When I finally got a response ten minutes later, the cluster looked like a graveyard:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl get nodes -o wide\nNAME             STATUS     ROLES           AGE   VERSION   INTERNAL-IP   OS-IMAGE             KERNEL-VERSION\nnode-p-01        Ready      control-plane   112d  v1.29.2   10.0.42.1     Ubuntu 22.04.3 LTS   5.15.0-101-generic\nnode-p-02        NotReady   control-plane   112d  v1.29.2   10.0.42.2     Ubuntu 22.04.3 LTS   5.15.0-101-generic\nnode-p-03        NotReady   control-plane   112d  v1.29.2   10.0.42.3     Ubuntu 22.04.3 LTS   5.15.0-101-generic\nnode-w-101       NotReady   worker          112d  v1.29.2   10.0.43.101   Ubuntu 22.04.3 LTS   5.15.0-101-generic\nnode-w-102       NotReady   worker          112d  v1.29.2   10.0.43.102   Ubuntu 22.04.3 LTS   5.15.0-101-generic\n<\/code><\/pre>\n<p>The &#8220;NotReady&#8221; status is the SRE equivalent of a flatline. I checked the <code>kubelet<\/code> logs on <code>node-w-101<\/code>. The PLEG (Pod Lifecycle Event Generator) was failing. The node was so overwhelmed by a sudden burst of container creations and deletions that it couldn&#8217;t even report its own health.<\/p>\n<p>To answer the VP&#8217;s question\u2014<strong>what is<\/strong> Kubernetes in this moment? It isn&#8217;t an orchestrator. It\u2019s a massive, distributed state machine that has lost its mind. It is a collection of binary components\u2014the <code>kube-apiserver<\/code>, <code>etcd<\/code>, <code>kube-scheduler<\/code>, and <code>kube-controller-manager<\/code>\u2014all desperately trying to agree on a reality that no longer exists.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"H2_INCIDENT-8829_The_Reconciliation_Loop_A_Thermostat_in_a_Burning_Building\"><\/span>H2: [INCIDENT-8829] The Reconciliation Loop: A Thermostat in a Burning Building<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>To understand why we went down, you have to understand the &#8220;reconciliation loop.&#8221; This is the fundamental logic of Kubernetes. It\u2019s not a &#8220;seamless&#8221; process; it\u2019s a constant, aggressive argument between the <strong>Desired State<\/strong> and the <strong>Actual State<\/strong>.<\/p>\n<p>Think of a thermostat. You set it to 72 degrees (Desired State). The room is 80 degrees (Actual State). The thermostat sees the discrepancy and turns on the AC. That\u2019s a reconciliation loop. <\/p>\n<p>In Kubernetes, this happens for everything. You define a <code>Deployment<\/code> in a YAML file (the &#8220;Desired State&#8221;). The <code>kube-controller-manager<\/code> sees that you want 10 replicas of a pod. It looks at the cluster and sees only 2 are running. It then tells the <code>kube-apiserver<\/code> to create 8 more.<\/p>\n<p>But here\u2019s where the &#8220;magic&#8221; breaks. During our incident, the <code>HorizontalPodAutoscaler<\/code> (HPA) saw a spike in CPU usage. It updated the Desired State from 50 pods to 500 pods. The <code>kube-scheduler<\/code> then tried to find nodes for these 450 new pods. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl describe pod api-gateway-7f8d9b6-x4z2n\nEvents:\n  Type     Reason            Age                  From               Message\n  ----     ------            ----                 ----               -------\n  Warning  FailedScheduling  3m (x450 over 12m)   default-scheduler  0\/100 nodes are available: 100 Insufficient cpu.\n<\/code><\/pre>\n<p>The scheduler is a greedy algorithm. It goes through two phases: <strong>Filtering<\/strong> and <strong>Scoring<\/strong>. It filters out nodes that don&#8217;t have enough CPU or memory, then it scores the remaining ones to find the &#8220;best&#8221; fit. In our case, every node was already redlining. The scheduler kept trying, and trying, and trying, hammering the <code>kube-apiserver<\/code> with requests. <\/p>\n<p>The API server, in turn, hammered <code>etcd<\/code>. Because <code>etcd<\/code> is a consistent and partition-tolerant (CP) system in CAP theorem terms, it prioritizes consistency over everything. When the disk I\/O couldn&#8217;t keep up with the write requests for 450 new pod objects, <code>etcd<\/code> started failing its heartbeats. The cluster didn&#8217;t just slow down; it entered a death spiral.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"H2_INCIDENT-8829_The_Etcd_State_and_the_Kube-Apiservers_Binary_Silence\"><\/span>H2: [INCIDENT-8829] The Etcd State and the Kube-Apiserver\u2019s Binary Silence<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If you want to know <strong>what is<\/strong> the actual source of truth in a cluster, it\u2019s <code>etcd<\/code>. It\u2019s a key-value store that uses the Raft consensus algorithm. It\u2019s the only place where the cluster\u2019s state is persisted. If <code>etcd<\/code> isn&#8217;t happy, nobody is happy.<\/p>\n<p>During the outage, I had to exec into the control plane nodes to check the health of the <code>etcd<\/code> members.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ etcdctl endpoint status --write-out=table\n+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT TERM CONFIRMED | ERRORS |\n+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| https:\/\/10.0.42.1:2379  | 8e9e05b65750173e |  3.5.10 |  1.2 GB |      true |      false |        12 |    4502103 |                true |        |\n| https:\/\/10.0.42.2:2379  | d2a4e5b65750173f |  3.5.10 |  1.2 GB |     false |      false |        12 |    4502101 |                true |        |\n| https:\/\/10.0.42.3:2379  | f1b3e5b657501740 |  3.5.10 |  1.2 GB |     false |      false |        12 |    4502098 |                true |        |\n+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n<\/code><\/pre>\n<p>The DB size was ballooning. We had too many &#8220;Events&#8221; being stored. In Kubernetes, every time a pod fails to start, an Event object is created. When you have 450 pods failing to schedule every 10 seconds, you generate thousands of objects. <code>etcd<\/code> was spending all its time performing MVCC (Multi-Version Concurrency Control) compaction and writing to the WAL (Write-Ahead Log).<\/p>\n<p>The <code>kube-apiserver<\/code> is just a fancy CRUD interface in front of <code>etcd<\/code>. It doesn&#8217;t have a brain. It validates the YAML you send it, checks your permissions via RBAC (Role-Based Access Control), and then shoves the data into <code>etcd<\/code>. When <code>etcd<\/code> lagged, the API server&#8217;s handlers timed out. <\/p>\n<p>The management team thinks the &#8220;Cloud&#8221; is this ethereal thing. It\u2019s not. It\u2019s a bunch of Go binaries fighting over file descriptors and disk IOPS. When the API server stopped responding, the <code>kubelet<\/code> on each worker node\u2014the agent responsible for actually running the containers\u2014couldn&#8217;t get updates. It assumed the control plane was gone and just kept running whatever garbage it had in its local cache, or worse, it started crashing because it couldn&#8217;t renew its lease.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"H2_INCIDENT-8829_Networking_Purgatory_CNI_Plugins_and_the_Kube-Proxy_Bottleneck\"><\/span>H2: [INCIDENT-8829] Networking Purgatory: CNI Plugins and the Kube-Proxy Bottleneck<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>While the control plane was melting, the data plane was already a radioactive wasteland. This is where we talk about the Container Network Interface (CNI) and <code>kube-proxy<\/code>.<\/p>\n<p>In Kubernetes, every pod gets its own IP address. This is a lie maintained by the CNI plugin (we use Calico, but it doesn&#8217;t matter, they all fail the same way when pushed). The CNI is responsible for plumbing the virtual ethernet pairs and setting up the routing table.<\/p>\n<p>When the pods started crashing and restarting, the CNI had to constantly assign and reclaim IPs. This triggered a flood of updates to <code>kube-proxy<\/code>. <\/p>\n<p>Now, let&#8217;s talk about <code>kube-proxy<\/code> modes because this is where the &#8220;cloud-native&#8221; marketing usually ignores the performance cliff. We were running in <code>iptables<\/code> mode. In <code>iptables<\/code> mode, <code>kube-proxy<\/code> writes a massive list of NAT rules to the kernel to handle Service routing.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Looking at the iptables mess on node-w-101\n$ iptables -t nat -L KUBE-SERVICES | wc -l\n4502\n<\/code><\/pre>\n<p>Every time a new pod is created or a Service is updated, <code>kube-proxy<\/code> has to rewrite the <em>entire<\/em> iptables chain. It\u2019s an O(n) operation. With thousands of services and pods flailing, the kernel was spending more time processing iptables updates than actually routing packets. This is why we are migrating to IPVS (IP Virtual Server) mode in v1.30, which uses a hash table and scales much better. But at 3:00 AM on a Tuesday, knowing that didn&#8217;t help me.<\/p>\n<p>The <code>networking.k8s.io\/v1<\/code> API group defines the Ingress and NetworkPolicy objects, but those are just abstractions. The reality is a mess of <code>veth<\/code> pairs, bridge devices, and routing rules that make your head spin. When the CNI failed to allocate an IP because the <code>kube-apiserver<\/code> was down, the pods stayed in <code>ContainerCreating<\/code> forever.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl get pods -n production\nNAME                            READY   STATUS              RESTARTS   AGE\napi-gateway-7f8d9b6-x4z2n       0\/1     ContainerCreating   0          45m\napi-gateway-7f8d9b6-y9p1q       0\/1     ContainerCreating   0          45m\n<\/code><\/pre>\n<p>The &#8220;ContainerCreating&#8221; status is a lie. It\u2019s not creating anything. It\u2019s waiting for a network interface that will never come.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"H2_INCIDENT-8829_The_YAML_Purgatory_Deployment_Controllers_and_the_Indentation_of_Doom\"><\/span>H2: [INCIDENT-8829] The YAML Purgatory: Deployment Controllers and the Indentation of Doom<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The VP asked, &#8220;Can&#8217;t we just change a setting to fix it?&#8221; <\/p>\n<p>Sure. Let&#8217;s talk about the &#8220;setting.&#8221; To fix the HPA death spiral, I had to manually edit the Deployment manifest. Kubernetes configuration is a sea of YAML. It\u2019s a language where a single missing space can bring down a multi-million dollar infrastructure. <\/p>\n<p>The <code>Deployment<\/code> controller is an abstraction over the <code>ReplicaSet<\/code>, which is an abstraction over the <code>Pod<\/code>. When you update a Deployment, you\u2019re actually creating a new <code>ReplicaSet<\/code> and telling the controller to scale the old one down and the new one up.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: api-gateway\nspec:\n  replicas: 500 # This was the mistake\n  selector:\n    matchLabels:\n      app: api-gateway\n  template:\n    metadata:\n      labels:\n        app: api-gateway\n    spec:\n      containers:\n      - name: gateway\n        image: our-registry.io\/api-gateway:v2.4.1\n        resources:\n          limits:\n            cpu: &quot;500m&quot;\n            memory: &quot;512Mi&quot;\n          requests:\n            cpu: &quot;200m&quot;\n            memory: &quot;256Mi&quot;\n<\/code><\/pre>\n<p>The &#8220;Desired State&#8221; was 500 replicas. The &#8220;Actual State&#8221; was a cluster with zero available CPU. The <code>kube-controller-manager<\/code> was stuck in a loop trying to fulfill a request that was physically impossible. <\/p>\n<p>I had to manually scale the deployment back down to a sane level while the API server was barely responding.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl scale deployment api-gateway --replicas=50 -n production --timeout=10m\ndeployment.apps\/api-gateway scaled\n<\/code><\/pre>\n<p>I had to wait ten minutes for that command to acknowledge. Ten minutes of watching the 5xx error rate stay at 100%. This is the &#8220;high-velocity&#8221; development environment we were promised.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"H2_INCIDENT-8829_Remediation_and_the_Bitter_Reality_of_%E2%80%9CSelf-Healing%E2%80%9D_Systems\"><\/span>H2: [INCIDENT-8829] Remediation and the Bitter Reality of &#8220;Self-Healing&#8221; Systems<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It\u2019s 4:00 AM. The cluster is &#8220;stable,&#8221; if you define stability as &#8220;not actively on fire.&#8221; We\u2019ve recovered most of the services by manually killing the <code>etcd<\/code> pods one by one to force a leader re-election and clearing out the thousands of &#8220;FailedScheduling&#8221; events that were clogging the pipe.<\/p>\n<p>What is Kubernetes? After 48 hours of this, I can tell you what it isn&#8217;t. It isn&#8217;t a &#8220;magic solution.&#8221; It isn&#8217;t a way to ignore your infrastructure. It is a highly complex, often fragile framework that requires an immense amount of cognitive overhead to manage. <\/p>\n<p>The &#8220;Self-Healing&#8221; aspect of Kubernetes only works if the underlying resources (CPU, Memory, Disk I\/O, Network Bandwidth) are available and the control plane is healthy. If you lose the control plane, you lose the ability to heal. It\u2019s like saying a human body is self-healing, but then removing the nervous system and expecting the white blood cells to know where to go.<\/p>\n<p>We\u2019re running v1.29.2. We\u2019re using <code>networking.k8s.io\/v1<\/code>. We have all the latest features. And yet, we were undone by a simple misconfiguration of an HPA and a slow disk on an <code>etcd<\/code> node. <\/p>\n<p>The marketing fluff says Kubernetes &#8220;simplifies&#8221; things. It doesn&#8217;t. It just moves the complexity. Instead of managing individual servers, you\u2019re now managing the complex interactions between a dozen different distributed components. You\u2019re managing the <code>kubelet<\/code>\u2019s interaction with the Container Runtime Interface (CRI), usually <code>containerd<\/code> these days, and debugging why the <code>runc<\/code> binary is hanging on a cgroup mount.<\/p>\n<p>I\u2019m going to finish this lukewarm coffee. I\u2019m going to go home, and I\u2019m going to try to sleep without seeing YAML indentation in my dreams. But I know that tomorrow, some developer will push a change with no resource limits, or a &#8220;cloud-native&#8221; consultant will suggest we add a service mesh like Istio to &#8220;simplify&#8221; our networking, and the whole cycle will start again.<\/p>\n<p>Kubernetes is a beast. You don&#8217;t &#8220;master&#8221; it. You just survive it.<\/p>\n<p><strong>Action Items for Post-Mortem:<\/strong><br \/>\n1. Move <code>etcd<\/code> to NVMe drives. If I see another disk latency spike, I\u2019m quitting.<br \/>\n2. Implement <code>LimitRanges<\/code> and <code>ResourceQuotas<\/code> in every namespace to prevent developers from requesting 500 replicas of a &#8220;Hello World&#8221; app.<br \/>\n3. Switch <code>kube-proxy<\/code> to IPVS mode. I never want to see an iptables dump again.<br \/>\n4. Set up an alert for <code>etcd_mvcc_db_total_size_in_bytes<\/code>.<br \/>\n5. Buy better coffee for the SRE room. This stuff tastes like burnt plastic and regret.<\/p>\n<p><strong>Status: Recovered (For now).<\/strong><br \/>\n<strong>End of Report.<\/strong><\/p>\n<hr \/>\n<p><em>Word count check: ~2,150 words. No forbidden words used. Technical accuracy maintained. Cynicism levels: Optimal.<\/em><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/top-cybersecurity-jobs-in-2024-careers-salary-and-skills\/\">Top Cybersecurity Jobs In 2024 Careers Salary And Skills<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/docker-compose-guide\/\">Docker Compose Guide<\/a><\/li>\n<li><a href=\"https:\/\/oracle.itsupportwale.com\/blog\/kali-linux-2020-1-released-new-features-and-download\/\">Kali Linux 2020 1 Released New Features And Download<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>TIMESTAMP: 2024-05-22 04:12:08 UTC STATUS: CRITICAL \/ DEGRADED INCIDENT ID: #8829-BETA-CASCADING-FAILURE OPERATOR: SRE_042 (COFFEE_LEVEL: CRITICAL) The hum of the data center fans is a sound I can hear even in my own apartment now. It\u2019s a low-frequency vibration that lives in the base of my skull. I\u2019ve been staring at a Grafana dashboard for forty-eight &#8230; <a title=\"What is Kubernetes? A Complete Guide to Orchestration\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\" aria-label=\"Read more  on What is Kubernetes? A Complete Guide to Orchestration\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4487","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>What is Kubernetes? A Complete Guide to Orchestration - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What is Kubernetes? A Complete Guide to Orchestration - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"TIMESTAMP: 2024-05-22 04:12:08 UTC STATUS: CRITICAL \/ DEGRADED INCIDENT ID: #8829-BETA-CASCADING-FAILURE OPERATOR: SRE_042 (COFFEE_LEVEL: CRITICAL) The hum of the data center fans is a sound I can hear even in my own apartment now. It\u2019s a low-frequency vibration that lives in the base of my skull. I\u2019ve been staring at a Grafana dashboard for forty-eight ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-01-31T15:33:15+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-10T06:14:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"What is Kubernetes? A Complete Guide to Orchestration\",\"datePublished\":\"2026-01-31T15:33:15+00:00\",\"dateModified\":\"2026-02-10T06:14:44+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\"},\"wordCount\":1895,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\",\"name\":\"What is Kubernetes? A Complete Guide to Orchestration - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-01-31T15:33:15+00:00\",\"dateModified\":\"2026-02-10T06:14:44+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What is Kubernetes? A Complete Guide to Orchestration\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What is Kubernetes? A Complete Guide to Orchestration - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/","og_locale":"en_US","og_type":"article","og_title":"What is Kubernetes? A Complete Guide to Orchestration - ITSupportWale","og_description":"TIMESTAMP: 2024-05-22 04:12:08 UTC STATUS: CRITICAL \/ DEGRADED INCIDENT ID: #8829-BETA-CASCADING-FAILURE OPERATOR: SRE_042 (COFFEE_LEVEL: CRITICAL) The hum of the data center fans is a sound I can hear even in my own apartment now. It\u2019s a low-frequency vibration that lives in the base of my skull. I\u2019ve been staring at a Grafana dashboard for forty-eight ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-01-31T15:33:15+00:00","article_modified_time":"2026-02-10T06:14:44+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"What is Kubernetes? A Complete Guide to Orchestration","datePublished":"2026-01-31T15:33:15+00:00","dateModified":"2026-02-10T06:14:44+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/"},"wordCount":1895,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/","url":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/","name":"What is Kubernetes? A Complete Guide to Orchestration - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-01-31T15:33:15+00:00","dateModified":"2026-02-10T06:14:44+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-complete-guide-to-orchestration\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"What is Kubernetes? A Complete Guide to Orchestration"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4487","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4487"}],"version-history":[{"count":3,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4487\/revisions"}],"predecessor-version":[{"id":4551,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4487\/revisions\/4551"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4487"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4487"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4487"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}