$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-7689d884b-l2v98 0/1 CrashLoopBackOff 42 (3m ago) 72h
kube-system kube-proxy-z4m2n 0/1 Error 15 72h
production api-gateway-v2-7f5d9c8d4b-9w2k1 0/2 ImagePullBackOff 0 14m
production order-processor-5566778899-abc12 0/1 CreateContainerConfigError 0 12m
production payment-service-8899aabbcc-xyz34 0/1 Terminating 0 72h
production auth-service-66778899aa-def56 0/1 Pending 0 5m
monitoring prometheus-server-0 0/1 CrashLoopBackOff 112 72h
ingress-nginx ingress-nginx-controller-646d5d4d54-m9s2z 0/1 ValidatingWebhookConfiguration 0 2m
kube-system etcd-ip-10-0-64-12.ec2.internal 0/1 Error 9 72h
$ kubectl describe node ip-10-0-64-12.ec2.internal
Name: ip-10-0-64-12.ec2.internal
Status: Ready
Conditions:
Type Status LastHeartbeatTime Reason Message
—- —— —————– —— ——-
NetworkUnavailable False Thu, 24 Oct 2024 03:14:22 +0000 RouteCreated RouteCreated.
MemoryPressure True Thu, 24 Oct 2024 04:45:10 +0000 KubeletHasInsufficientMemory kubelet has insufficient memory available
DiskPressure False Thu, 24 Oct 2024 04:45:10 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 24 Oct 2024 04:45:10 +0000 KubeletHasNoPidPressure kubelet has no pid pressure
Ready False Thu, 24 Oct 2024 04:45:10 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized
Events:
Type Reason Age From Message
—- —— —- —- ——-
Warning EvictionThresholdMet 5m kubelet Threshold observed at memory.available=482Mi, boundary=500Mi
The sun is coming up. Or maybe it’s going down. I can’t tell because the blinds are drawn and the only light in this room comes from a 32-inch monitor displaying a wall of red text and a terminal window that looks like a crime scene. My hands are shaking, not from caffeine—though I’ve consumed enough cold espresso to stop a horse’s heart—but from the sheer, unadulterated adrenaline of watching 4,000 nodes commit collective suicide because of a single character error in a ValidatingWebhookConfiguration.
You might be asking **what is** the point of all this abstraction? Why do we subject ourselves to this? We took the relatively simple problem of running a binary on a server and wrapped it in fourteen layers of YAML, virtual networking, and distributed consensus algorithms until it became a sentient beast that hates us.
Kubernetes version 1.30 was supposed to be "stable." They talked about "Structured Authentication" and "Node Log Query." They didn't talk about how, when the admission controller goes dark, the API server starts choking on its own tongue, and the entire control plane turns into a circular firing squad.
## The Control Plane: The Brain That Forgets
The Control Plane is marketed as the "brain" of the cluster. In reality, it’s a collection of anxious bureaucrats who refuse to talk to each other unless they have a signed certificate and a 200 OK response. At the center sits `kube-apiserver`. It is the only thing that talks to the database. Everything else—the scheduler, the controller manager, your frantic `kubectl` commands—is just a client.
When the outage hit at 3 AM three days ago, the `kube-apiserver` wasn't just failing; it was screaming. A misconfigured admission controller—a piece of code meant to "validate" objects before they are persisted—was pointing to a service that didn't exist anymore. Because the webhook was set to `failurePolicy: Fail`, the API server stopped accepting *any* pod updates.
[INTERNAL MONOLOGUE: Why did we let the junior dev touch the webhooks? Why did I approve the PR? I was thinking about lunch. I was thinking about a sandwich while I signed the death warrant for our production environment.]
In v1.30, the API server is more "efficient," which just means it fails faster. When the webhook timed out, the request latencies spiked. The `kube-controller-manager` noticed the nodes weren't reporting in because their status updates were being rejected by the same broken webhook. It did what it was programmed to do: it assumed the nodes were dead and started rescheduling 10,000 pods. But it couldn't create the new pods because—you guessed it—the webhook was failing.
Here is the `ResourceQuota` we had in place, which did absolutely nothing to stop the cascading failure because the failure happened before the quota was even checked:
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: production
spec:
hard:
requests.cpu: "200"
requests.memory: 500Gi
limits.cpu: "400"
limits.memory: 800Gi
pods: "1000"
services: "50"
replicationcontrollers: "20"
resourcequotas: "1"
The Control Plane is a lie. It’s a series of loops. The “Reconciliation Loop” is just a fancy way of saying “I’m going to keep trying to do this thing until I die or the universe ends.” When the state in etcd doesn’t match the state on the ground, the controllers panic.
Table of Contents
Etcd: The Consistency Nightmare
If the API server is the brain, etcd is the memory. And our memory is currently corrupted by a thousand “deadline exceeded” errors. etcd uses the Raft consensus algorithm. It requires a majority to agree on anything. If you have three nodes and two of them stop talking because the underlying EBS volume decided to have a “moment,” your cluster is a brick.
During the height of the outage, etcd was reporting disk sync durations in the seconds.
# journalctl -u etcd
Oct 24 02:15:10 etcd-node-1 etcd[1234]: slow HTTP response from 10.0.64.12:2379 took 2.4512s
Oct 24 02:15:12 etcd-node-1 etcd[1234]: failed to send out heartbeat on time (exceeded 100ms)
Oct 24 02:15:12 etcd-node-1 etcd[1234]: server is likely overloaded
When etcd lags, the world stops. The API server can’t write the “I’m alive” heartbeat from the Kubelet. The Control Plane thinks the node is gone. It marks it as Unknown. It tries to move the work. But the work can’t move. You end up with “Ghost Pods”—containers running on a node that the API server thinks is empty, while the scheduler tries to cram more containers onto that same exhausted hardware.
The Kubelet: The Overworked Janitor
On every single node, there is a kubelet. This is the most honest piece of software in the whole stack. It doesn’t care about “service meshes” or “serverless.” It just wants to run containers. It watches the API server for pods assigned to its node, and then it talks to the container runtime to make it happen.
But the kubelet is a snitch. It constantly reports back on the health of the node. When the network is saturated because your CNI is flapping, the kubelet can’t send its heartbeats.
[INTERNAL MONOLOGUE: I can hear the fans in the server room from here. They sound like a jet engine taking off. That’s the sound of the Kubelet trying to calculate cgroup metrics while the CPU is throttled to 10%.]
Look at this journalctl output from one of the dying nodes. This is what a nervous breakdown looks like in Go:
# journalctl -u kubelet -f
Oct 24 04:50:01 node-1 kubelet[998]: E1024 04:50:01.123456 998 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to "StartContainer" for "runtime" with CrashLoopBackOff: "back-off 5m0s restarting failed container=api-gateway pod=api-gateway-v2-7f5d9c8d4b-9w2k1_production""
Oct 24 04:50:05 node-1 kubelet[998]: I1024 04:50:05.555555 998 status_manager.go:652] "Failed to update status" pod="payment-service-8899aabbcc-xyz34" err="node \"ip-10-0-64-12.ec2.internal\" not found"
Oct 24 04:50:10 node-1 kubelet[998]: E1024 04:50:10.888888 998 kubelet.go:2450] "Error getting node" err="node \"ip-10-0-64-12.ec2.internal\" not found"
The kubelet is trying to update the status of a pod, but the API server is telling it that the node it’s running on doesn’t exist. This is the gaslighting of the SRE. You are staring at a server, logged in via SSH, and the system is telling you the server is a figment of your imagination.
In v1.30, the Kubelet has better handling for memory swap, but that doesn’t help when your containerd socket is unresponsive because the kernel is OOM-killing the runtime itself.
Container Runtime: The Shaky Foundation
Kubernetes doesn’t actually run containers. It asks containerd or CRI-O to do it. This is the Container Runtime Interface (CRI). It’s another layer of indirection. When you see ImagePullBackOff, it’s usually not because the image isn’t there. It’s because the runtime’s credentials expired, or the CNI failed to set up the bridge interface, or the disk is so slow that the extraction of the layer timed out.
We had a Deployment that looked like this. It’s a standard piece of garbage, full of “best practices” that become “worst nightmares” during an outage:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway-v2
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
containers:
- name: gateway
image: our-priv-reg.io/api-gateway:v2.1.4
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "2Gi"
During the outage, the livenessProbe was the killer. Because the network was congested, the probes timed out. The kubelet then killed the container. This triggered a restart. The restart triggered an image pull. The image pull triggered more network traffic. The network traffic caused more probe timeouts. It’s a self-amplifying feedback loop of failure.
[INTERNAL MONOLOGUE: I should have used startupProbes. I knew it. I wrote the documentation on it. And yet, here I am, watching my own creation choke itself to death.]
CNI: The Plumbing from Hell
The Container Network Interface (CNI) is where the real dark magic happens. This is what allows a pod on Node A to talk to a pod on Node B. It involves BGP, VXLAN, Geneve, or just a massive pile of iptables rules that nobody understands.
When the CNI fails, it fails silently. You’ll see “Ready” nodes, but no traffic flows. In our case, the CNI (we’re using Cilium, because we like complexity) couldn’t allocate IPs because the CiliumNode custom resource couldn’t be updated in the API server.
What is a network in Kubernetes? It’s a hallucination. It’s a series of virtual interfaces (veth pairs) and routing table entries that are constantly being rewritten. In v1.30, there are improvements in how the NodeIPAM handles ranges, but that doesn’t matter when your iptables-restore command is taking 30 seconds to run because you have 50,000 services.
Every time a pod is created, the CNI has to:
1. Create a network namespace.
2. Create a veth pair.
3. Attach one end to the container and the other to the host bridge or OVS.
4. Assign an IP.
5. Set up routes.
6. Configure NAT for egress.
If any of those steps fail—say, because the kernel is locked up trying to process a million packets—the pod stays in ContainerCreating forever.
The Admission Controller: The Gatekeeper with a Grudge
This was the source of the 72-hour hell. Admission controllers are plugins that govern what the API server allows. There are built-in ones (like ResourceQuota) and “Dynamic” ones (Mutating and Validating Webhooks).
We use a Validating Webhook to ensure no one deploys a container as root. It’s a noble goal. But the webhook is an external service. When that service’s pod was evicted due to “Memory Pressure” (caused by the very outage it was about to exacerbate), the API server couldn’t reach it.
Because the webhook was configured like this:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: security-policy-checker
webhooks:
- name: check-privileges.example.com
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
clientConfig:
service:
namespace: security
name: policy-webhook
failurePolicy: Fail # <--- THIS IS THE LOADED GUN
sideEffects: None
admissionReviewVersions: ["v1"]
timeoutSeconds: 30
The failurePolicy: Fail meant that if the webhook didn’t respond, the Pod creation/update was rejected. Since the webhook itself was a Pod, and it was down, the scheduler couldn’t restart it because the API server couldn’t validate the new Pod. It was a deadlock. A perfect, beautiful circle of nothingness.
I had to manually edit the ValidatingWebhookConfiguration via kubectl while the API server was timing out, just to set that policy to Ignore. It took two hours just to get that one command to go through.
The PodDisruptionBudget: The Final Insult
When I finally tried to drain the nodes to reset the runtime, I was stopped by the PodDisruptionBudget (PDB).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 80%
selector:
matchLabels:
app: api-gateway
The PDB said “You cannot take this node down because it would drop the availability of the API Gateway below 80%.” But the API Gateway was already at 0% availability because of the CrashLoopBackOff. The PDB doesn’t care if the pods are actually working; it only cares if they exist. So I couldn’t drain the nodes to fix the nodes because the nodes were broken.
I had to delete the PDBs. I had to delete the webhooks. I had to delete my own pride.
Linux Primitives: The Real World Under the Hood
At the end of the day, Kubernetes is just a very expensive wrapper around Linux primitives. If you strip away the YAML and the Go binaries, you are left with namespaces and cgroups.
Namespaces are the isolation.
– mnt: Different filesystems.
– net: Different network stacks.
– pid: The container thinks it’s PID 1.
– uts: Different hostnames.
– ipc: Isolated inter-process communication.
– user: Mapping root in the container to a nobody on the host.
Cgroups (Control Groups) are the resource limits. In v1.30, we are fully into the cgroup v2 era. This is what handles the memory limits that OOM-kill your Java apps. When you set a memory limit of 2Gi, the kernel’s OOM killer is watching that cgroup. The moment the resident set size (RSS) hits 2.00001Gi, the kernel sends a SIGKILL.
Kubernetes tries to be smart about this, but the kernel is brutal. There is no “graceful shutdown” for an OOM kill. The process is just gone. The kubelet sees the process is gone, looks at the exit code (137), and realizes it was an OOM kill. Then it updates the pod status.
[INTERNAL MONOLOGUE: I’m staring at the top output on a node. The load average is 450. On a 64-core machine. The system is spending all its time in iowait. The disk is dying. The containers are dying. I am dying.]
What is the point?
What is the point of a system that is so complex it requires a 72-hour war room to fix a single config change? We wanted “self-healing” infrastructure. What we got was a system that is very good at healing itself from small, predictable failures, but spectacularly good at accelerating large, unpredictable ones.
Kubernetes v1.30 is a marvel of engineering. It is also a testament to our hubris. We have built a platform that can scale to 5,000 nodes but can be brought to its knees by a single malformed YAML file.
I’m going to finish this coffee. It’s cold and tastes like battery acid. Then I’m going to delete the remaining Evicted pods, check the etcd leader election metrics one last time, and go to sleep for a week. Or at least until the next PagerDuty alert at 3 AM.
Because the “brain” never sleeps. It just waits for you to make a mistake.
“`bash
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-64-12.ec2.internal Ready control-plane 72h v1.30.1
ip-10-0-64-13.ec2.internal Ready worker 72h v1.30.1
ip-10-0-64-14.ec2.internal Ready worker 72h v1.30.1
Related Articles
Explore more insights and best practices:
- How To Install Latest Php 7 3 On Ubuntu 18 04
- Mastering Machine Learning Models Types And Use Cases
- Javascript Best Practices Write Cleaner Efficient Code