Mastering Kubernetes Docs: A Guide for Cloud Engineers

The pager went off at 3:14 AM, a timestamp I’ve come to associate with the smell of burnt coffee and the inevitable realization that our high-availability setup was a lie.

I was three hours into what I thought was a “stable” sleep cycle after a week of migrating our production clusters from v1.28.x to v1.30.1 on a custom Debian bookworm image. The alert wasn’t a gentle nudge; it was a screaming banshee in the form of a PagerDuty “Critical” notification: TargetDown across the entire ingress-nginx fleet, followed immediately by KubeNodeNotReady for 40% of the cluster.

I stared at my monitor, the blue light searing my retinas, and watched the terminal output of kubectl get nodes scroll by like a digital obituary.

1. The 3:14 AM Alert: When “High Availability” Becomes a Joke

The cluster was a ghost town. Pods were stuck in ContainerCreating. Probes were failing. The API server was responding, but it was sluggish, gasping for air as etcd heartbeats started spiking to 500ms.

$ kubectl get nodes
NAME             STATUS     ROLES           AGE   VERSION
ip-10-0-42-101   NotReady   control-plane   14d   v1.30.1
ip-10-0-42-102   Ready      control-plane   14d   v1.30.1
ip-10-0-42-103   NotReady   control-plane   14d   v1.30.1
ip-10-0-45-12    NotReady   worker          14d   v1.30.1
ip-10-0-45-13    NotReady   worker          14d   v1.30.1
ip-10-0-45-14    NotReady   worker          14d   v1.30.1

I checked the taints. Every single NotReady node had the dreaded node.kubernetes.io/network-unavailable:NoSchedule taint. This is the Kubernetes equivalent of a “Do Not Resuscitate” order. If the network isn’t ready, the Kubelet won’t let anything run. But why now? We hadn’t touched the CNI config in weeks. Or so I thought.

I pulled the logs from a dying Kubelet on one of the worker nodes.

$ journalctl -u kubelet -n 100 --no-pager
May 14 03:16:22 ip-10-0-45-12 kubelet[1204]: E0514 03:16:22.124532    1204 cni.go:205] "Error validating CNI config list" err="[failed to find plugin \"aws-cni\" in path [/opt/cni/bin]]" config="{\"cniVersion\":\"1.0.0\",\"name\":\"aws-cni\",\"plugins\":[{\"type\":\"aws-cni\"}]}"
May 14 03:16:24 ip-10-0-45-12 kubelet[1204]: E0514 03:16:24.442101    1204 kubelet.go:2452] "Error updating node status, retrying" err="node \"ip-10-0-45-12\" not found"
May 14 03:16:25 ip-10-0-45-12 kubelet[1204]: I0514 03:16:25.101221    1204 network_linux.go:88] "Setting up network priority"

“Failed to find plugin.” My heart sank. We use a custom CNI chain with Cilium sitting on top of the AWS VPC CNI for secondary IP exhaustion management. Somewhere in the dark, the binary had vanished or the path had shifted.

2. The Rabbit Hole: Searching for Answers in a Sea of Fluff

I did what every desperate SRE does: I went to the kubernetes docs. I was looking for the specific interaction between the Kubelet’s --cni-bin-dir flag and how v1.30.1 handles plugin discovery when multiple configuration files exist in /etc/cni/net.d/.

The kubernetes docs are a peculiar beast. They are written for a version of the world that doesn’t exist—a world where every cluster is a “Hello World” Minikube instance running on a developer’s laptop. I searched for “CNI plugin troubleshooting.” I was greeted with a “Tasks” section that told me how to install a CNI. I don’t need to know how to install it; I need to know why the Kubelet is suddenly blind to a binary that has been sitting in /opt/cni/bin for six months.

I navigated to the “Reference” section of the kubernetes docs. This is where the real pain begins. The Reference API docs are essentially a dump of the Go structs. They tell you that a field exists, but they don’t tell you why or what the side effects are when you change it. I was looking for the NetworkReady condition logic. The docs told me: “NetworkReady: True if the network for the node is correctly configured, False otherwise.”

Thanks, Captain Obvious. My cluster is on fire, and you’re giving me tautologies.

I spent the next four hours digging through the “Concepts” pages. I wanted to understand the transition from node.kubernetes.io/network-unavailable to a Ready state. The kubernetes docs suggested that the CNI plugin is responsible for clearing this taint. But which one? In a chained setup, is it the first plugin or the last? The docs were silent. They didn’t mention the race condition that occurs when the cloud-controller-manager initializes the node and sets the taint, but the CNI provider is waiting for the node to be “Ready” before it deploys its daemonset. It’s a circular dependency from hell.

3. The Betrayal: When Documentation and Reality Diverge

By 7:00 AM, the sun was coming up, and I was on my tenth cup of coffee. I had discovered a discrepancy. The kubernetes docs for v1.30 claim that the KubeletConfiguration field cniConfDir defaults to /etc/cni/net.d. However, looking at our kubeadm init configuration and the actual running process, the Kubelet was ignoring half the files in that directory.

I decided to check the source code. This is the SRE’s ultimate admission of defeat: when the kubernetes docs are so high-level that you have to read the actual Golang implementation to understand how your production environment works.

I pulled up pkg/kubelet/network/cni/cni.go in the Kubernetes GitHub repo. I compared it to what the kubernetes docs said about “CNI Plugin Selection.”

The docs say: “The Kubelet picks the first alphabetically ordered configuration file in the directory.”
The code said: Hold my beer.

In v1.29 and v1.30, there’s a subtle change in how the libcni library is invoked. If there’s a .conflist file and a .conf file, the behavior isn’t just “alphabetical.” There’s a specific logic that prioritizes configuration lists over individual configs, but only if the cniVersion matches specific criteria. None of this was in the kubernetes docs. Not a word.

I looked at our /etc/cni/net.d/:

$ ls /etc/cni/net.d/
05-cilium.conflist
10-aws.conf
99-loopback.conf

The Kubelet was supposed to pick 05-cilium.conflist. Instead, it was choking on a ghost reference to aws-cni that shouldn’t have even been active. I realized that during the v1.30.1 upgrade, the kubeadm join process had somehow dropped a default CNI config that was fighting with our Cilium manifests.

4. The Raw Truth: What Actually Happened in the Terminal

I needed to see what the Kubelet saw. I turned up the verbosity to --v=4 (because --v=5 is just a firehose of etcd heartbeats that will crash your terminal buffer).

$ systemctl stop kubelet
$ /usr/bin/kubelet --v=4 --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock ... [truncated]

The logs started screaming. It wasn’t just a path issue. It was an Admission Controller conflict. We had a custom MutatingAdmissionWebhook that was supposed to inject sidecars into our CNI pods (don’t ask, it was a “security requirement” from a guy who left the company two years ago). Because the network was down, the Webhook—which was running on the cluster—couldn’t be reached.

Because the Webhook couldn’t be reached, the API server refused to start any new pods. Because no new pods could start, the CNI pods (which had been killed during the upgrade) couldn’t restart.

It was a Deadlock. A perfect, beautiful, catastrophic loop.

I checked the kubernetes docs for “Admission Webhook Fail-Open Policy.” The docs said to set failurePolicy: Ignore. I checked our YAML. It was set to Ignore.

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: "sidecar-injector"
webhooks:
  - name: "injector.example.com"
    failurePolicy: Ignore
    rules:
      - operations: ["CREATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]

So why was it failing? I went back to the terminal.

$ kubectl get events -A --sort-by='.lastTimestamp'
NAMESPACE     LAST SEEN   TYPE      REASON             OBJECT                                  MESSAGE
kube-system   2m          Warning   FailedCreate       daemonset/cilium                        Error creating: Internal error occurred: failed calling webhook "injector.example.com": Post "https://injector.kube-system.svc:443/?timeout=10s": context deadline exceeded

“Context deadline exceeded.” Even with failurePolicy: Ignore, the API server was waiting for a 10-second timeout for every single pod creation request. With 500 nodes trying to spin up CNI pods, the API server’s request queue was backed up into the stratosphere. The kubernetes docs failed to mention that failurePolicy: Ignore doesn’t mean “skip immediately”; it means “wait for the timeout and then ignore.” When your timeout is 10 seconds and you have 2,000 pods in a crash loop, your cluster is effectively dead.

5. The Patch: The Hacky Fix That Saved the Day

I was 36 hours in. My eyes felt like they were filled with sand. I had two choices: try to fix the CNI config properly or perform a lobotomy on the cluster to get it breathing again. I chose the lobotomy.

First, I had to kill the webhook. But I couldn’t use kubectl delete because the API server was too bogged down by the timeout-induced backpressure.

I had to go into the master nodes and manually edit the kube-apiserver.yaml static pod manifest to disable the MutatingAdmissionWebhook admission plugin temporarily.

$ ssh control-plane-01
$ sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml
# Edited --enable-admission-plugins to remove MutatingAdmissionWebhook
$ sudo systemctl restart kubelet

Once the API server came back up without the webhook anchor around its neck, pods started scheduling. But they were still failing because of the CNI. I looked at the iptables chains. They were a mess. kube-proxy had left behind a graveyard of stale rules from the previous version.

$ iptables -t nat -L KUBE-SERVICES | grep cilium
KUBE-SVC-726H6A6X6X6X6X6X  tcp  --  anywhere             10.96.0.10           /* kube-system/coredns:dns-tcp */ tcp dpt:domain
# ... hundreds of lines of garbage ...

I ran a scorched-earth script to flush the CNI and reset the interface. This is the part they don’t teach you in the “Best Practices” section of the kubernetes docs.

# The "I give up" script
ip link delete cilium_host
ip link delete cilium_net
ip link delete cilium_vxlan
rm -rf /etc/cni/net.d/*
rm -rf /var/run/cilium
systemctl restart containerd

Then, I manually patched the nodes to remove the network-unavailable taint. I knew it was a lie—the network was unavailable—but I needed the Kubelet to stop sulking and try to run the Cilium agent again.

$ for node in $(kubectl get nodes -o name); do
  kubectl patch $node --type='json' -p='[{"op": "remove", "path": "/spec/taints/0"}]'
done

I watched the terminal. cilium-agent pods started transitioning from Pending to Running. I held my breath.

$ kubectl -n kube-system logs -l k8s-app=cilium -f
level=info msg="Successfully restored all endpoints" subsys=daemon
level=info msg="Cluster information updated" subsys=daemon

The nodes started turning Ready. One by one. Like lights flickering on in a dark city.

6. The Final Verdict: How to Use Documentation Without Losing Your Mind

It’s now 48 hours later. The cluster is stable. The webhook is back online (with a 1-second timeout and a very stern warning in the README). I am sitting in a quiet office, staring at the kubernetes docs again, specifically the page on “Node Lifecycle.”

I’ve come to a conclusion. The kubernetes docs are not a manual for running Kubernetes. They are a marketing brochure for the idea of Kubernetes. They describe a system that is self-healing, declarative, and “seamless.” They don’t describe the reality of a v1.30.1 control plane choking on a 10-second timeout while etcd loses quorum because of a disk I/O spike.

If you want to survive as an SRE, you have to treat the kubernetes docs as a starting point, not the source of truth. The source of truth is the code, the journalctl logs, and the raw output of iptables-save.

Here is my cynical guide to using the kubernetes docs:

  1. Ignore the “Tasks” section. It’s for people who are installing Kubernetes for the first time. If you’re in a production outage, the “Tasks” section is like reading a cookbook while your house is on fire.
  2. Treat the “Reference” section with suspicion. It tells you what a flag is, but it won’t tell you that the flag was deprecated three versions ago and replaced by a hidden field in a ConfigMap.
  3. Search the GitHub Issues, not the Docs. If you’re seeing a weird CNI error, ten other people have seen it too. Their frantic comments on a closed PR from 2022 are worth more than the entire “Concepts” section of the official site.
  4. Read the Source Code. If you’re running v1.30.1, you should have the kubernetes/kubernetes repo cloned locally. When the docs say “The Kubelet does X,” verify it in pkg/kubelet. You’ll be surprised how often “X” is actually “X, but only if Y is true and Z hasn’t timed out.”
  5. Build your own docs. Our internal Wiki now has a page titled “Why the CNI hates us,” which contains the actual commands we used to fix this. It’s three pages of raw terminal commands and zero fluff.

The kubernetes docs will tell you that the API server is the “brain” of the cluster. What they don’t tell you is that the brain is prone to migraines, and sometimes the only cure is a manual lobotomy and a complete flush of the nervous system.

I’m going to sleep now. If the pager goes off again, I’m throwing it into the ocean. Or better yet, I’ll just link the PagerDuty alert to the “Troubleshooting” page of the kubernetes docs and see if the cluster can figure it out itself. After all, it’s “self-healing,” right?

$ kubectl get pods -A | grep -v Running
# No output.
# Finally.
# Silence.

The terminal cursor blinks. A steady, rhythmic pulse in the dark. It’s the only thing in this entire stack that actually does what it’s supposed to do without needing a 2,000-word explanation or a “comprehensive” guide. It just waits. And so do I. Until the next 3:00 AM alert.

Related Articles

Explore more insights and best practices:

Leave a Comment