Incident ID: #8829-OMEGA. Status: Resolved (Barely). Subject: The day the load balancer decided to become a random number generator.
Incident Summary
* Duration: 02:04 UTC to 06:12 UTC (4 hours, 8 minutes).
* Impact: Total loss of ingress traffic for the api.production.internal and checkout.production.internal zones. Estimated revenue loss: $2.1M.
* Root Cause: A “minor” update to the Terraform-managed Nginx Ingress Controller configuration that introduced a malformed proxy_buffer_size value, coupled with an unpatched CVE-2023-44487 (HTTP/2 Rapid Reset) vulnerability that was triggered by the resulting retry storm.
* Versions Involved: Kubernetes v1.29.2, Terraform v1.7.4, Prometheus v2.45, Nginx Ingress Controller v1.9.4.
* Error Codes: HTTP 502 (Bad Gateway), 503 (Service Unavailable), 504 (Gateway Timeout), and a whole lot of ERR_CONNECTION_RESET.
Table of Contents
1. The “Minor” Change That Wasn’t
It started at 02:00 UTC. I was finally hitting the deep sleep phase of my night when the pager started vibrating against the nightstand like a caffeinated hornet. Some “DevOps Architect” (a title that usually means “I write YAML but don’t know how a kernel works”) decided that 2:00 AM on a Tuesday was the perfect time to optimize our ingress buffers.
The change was pushed via Terraform v1.7.4. The plan looked “clean” to the reviewer, probably because they were looking at it through bleary eyes or just didn’t care. Here’s what the terraform plan output looked like before the world ended:
# module.ingress_controller.kubernetes_config_map.nginx_config will be updated in-place
~ resource "kubernetes_config_map" "nginx_config" {
id = "ingress-nginx/ingress-nginx-controller"
~ data = {
"proxy-body-size" = "20m"
"proxy-buffer-size" = "128k" # Changed from 8k
"proxy-buffers-number" = "4"
"upstream-keepalive-timeout" = "60s"
}
}
The logic was that we were seeing some “Header too large” errors on a few edge cases. The “devops best” practice here, according to the internal wiki, was to increase the buffer size. But they didn’t just increase it; they set it to a value that exceeded the proxy_busy_buffers_size without updating the latter. Nginx, being the stubborn piece of C code that it is, didn’t complain during the reload. It just started dropping packets like they were hot coals.
Then the pager went off.
$ kubectl get pods -n ingress-nginx
NAME READY STATUS RESTARTS AGE
ingress-nginx-controller-7f8d9b6c5-2wzq8 0/1 CrashLoopBackOff 5 4m
ingress-nginx-controller-7f8d9b6c5-4kml2 0/1 Error 5 4m
ingress-nginx-controller-7f8d9b6c5-9p0rs 1/1 Running 0 4m
One pod stayed “Running” but was essentially a black hole. The others were stuck in a restart loop because the liveness probe was hitting a 502. We were effectively dark.
2. Cascading Failures and the Myth of Isolation
By 02:15 UTC, the ingress failure had triggered a massive retry storm. Because our frontend services (running on Kubernetes v1.29.2) didn’t have proper exponential backoff implemented in their client-side logic, they began hammering the ingress endpoints.
This is where the theory of “microservices” falls apart and reveals the “distributed monolith” we’ve actually built. Our checkout-service depends on the inventory-service, which depends on the pricing-service, which depends on a legacy oracle-db-connector that someone wrote in 2014 and we’re all too scared to touch.
When the ingress started failing, the checkout-service didn’t just fail gracefully. It held onto its database connections while waiting for the inventory-service to respond. The inventory-service was busy retrying its own calls. Within ten minutes, the connection pools were saturated.
I ran a quick check on the logs for the checkout-service:
$ journalctl -u checkout-service.service --since "02:10" | grep "ConnectionPoolTimeoutException" | wc -l
14502
Fourteen thousand timeouts in five minutes. The “devops best” approach of using service meshes like Istio v1.20 was supposed to prevent this with circuit breakers. But guess what? The circuit breakers were configured with “default” values that were too high to actually trip before the underlying node ran out of ephemeral ports. We weren’t isolated. We were tied together in a suicide pact.
3. The Distributed Monolith: A Suicide Pact in YAML
By 03:00 UTC, the entire cluster was a graveyard. I was looking at Prometheus v2.45 metrics, and the graphs looked like a heart attack. Memory usage on the nodes was spiking because of the sheer volume of buffered requests that were never going to be fulfilled.
We talk about “decoupling,” but we’ve just moved the coupling from the binary level to the network level. Every single one of our 40+ microservices was trying to talk to each other over a network that was currently being flooded by Nginx trying to figure out why its buffers were misaligned.
I tried to scale the deployment to see if more pods would help. Spoiler: It didn’t.
$ kubectl scale deployment checkout-service --replicas=50
deployment.apps/checkout-service scaled
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
checkout-service-8495f6b87c-abc12 0/1 Pending 0 10s
checkout-service-8495f6b87c-def34 0/1 Pending 0 10s
The pods stayed in Pending. Why? Because the “minor config change” had also somehow messed with the resource quotas in the namespace, or more likely, we had hit the maximum number of ENIs (Elastic Network Interfaces) on our AWS nodes. We were scaling into a brick wall. This is the reality of the “distributed monolith.” You don’t get the benefits of a monolith (simplicity, local calls), and you don’t get the benefits of microservices (isolation, independent scaling). You just get the complexity of both.
4. Shift Left, Fall Flat: The CVE We Ignored
While we were fighting the ingress fire, something else started happening. Our security dashboard (which everyone ignores until a P0 happens) started lighting up. Because the ingress was unstable, it was vulnerable to a specific type of resource exhaustion.
Enter CVE-2023-44487—the HTTP/2 Rapid Reset attack. We had “Shifted Left” our security by integrating scanners into the CI/CD pipeline, but the “devops best” practice of “failing the build on high vulnerabilities” had been disabled for the ingress controller because “we need to ship features, not fix infrastructure.”
The ingress controller (Nginx v1.9.4) was vulnerable. A botnet—likely automated and scanning for exactly this kind of instability—detected our flapping ingress and started a Rapid Reset attack. This wasn’t a targeted hit; it was opportunistic predation.
The logs were a nightmare:
2024/05/14 03:22:11 [error] 45#45: *1209342 stream 13579 reset: error code 7 while processing request, client: 192.168.1.1, server: api.production.internal
2024/05/14 03:22:11 [error] 45#45: *1209343 stream 13581 reset: error code 7 while processing request, client: 192.168.1.1, server: api.production.internal
The “Shift Left” philosophy failed because it was treated as a checkbox, not a culture. We had the data. We knew the version was vulnerable. But because the “devops best” practitioners were too focused on “velocity,” they ignored the technical debt. Now, that debt was being collected with 200% interest. The Rapid Reset attack was consuming 100% of the CPU on the remaining healthy ingress pods, making it impossible for us to even exec into them to debug.
5. Observability is Not a Dashboard, It’s a Crime Scene
By 04:30 UTC, the CTO was on the bridge call asking for “ETA on resolution.” I wanted to tell him the ETA was “whenever we stop pretending that YAML is a substitute for engineering,” but I just grunted and kept typing.
Our observability stack was also failing. Prometheus was struggling to scrape targets because the network was saturated. Grafana was showing “No Data” for half the panels. This is the irony of modern SRE work: the tools you use to fix the system are the first things to break when the system actually fails.
I had to go old school. I bypassed the dashboards and went straight to the nodes.
$ ssh node-01.prod.internal
$ sudo tcpdump -i eth0 port 80 -c 100
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
04:45:12.123456 IP 10.0.1.5.44322 > 10.0.1.10.80: Flags [S], seq 12345678, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
04:45:12.123510 IP 10.0.1.10.80 > 10.0.1.5.44322: Flags [R.], seq 0, ack 12345679, win 0, length 0
The [R.] flag. Connection reset. The ingress was refusing everything. Not because it was overloaded, but because the proxy-buffer-size mismatch had corrupted the internal state of the worker processes. Every time a request came in that required a buffer larger than the default but smaller than the “new” limit, Nginx would segfault or reset the connection.
Here’s where the theory hit the reality of a saturated disk. The Nginx pods were trying to write error logs to /var/log/nginx/error.log, but because the error rate was so high, the emptyDir volume we used for logs filled up.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 20G 20G 0G 100% /var/lib/kubelet/pods/.../volumes/kubernetes.io~empty-dir/logs
When the disk filled up, the ingress controller couldn’t even start. It would crash immediately on startup because it couldn’t open the log file. We were in a circular dependency of failure.
6. The Remediation Plan: Fixing the Culture, Not Just the Code
We finally got the system back online at 06:12 UTC by manually reverting the ConfigMap and nuking the entire ingress-nginx namespace to force a clean slate. We had to bypass the CI/CD pipeline because the Jenkins runner was—you guessed it—stuck in a Pending state because the cluster was full.
Here is the actual remediation plan. Not the “thought leadership” version, but the one written in the blood of a 48-hour shift.
Phase 1: Immediate Technical Debt Liquidation
- Enforce Buffer Symmetry: Any change to
proxy-buffer-sizemust be validated againstproxy_busy_buffers_sizeandproxy_buffersat the linting stage. We are adding a custom OPA (Open Policy Agent) rule to the “devops best” pipeline to prevent this specific Terraform configuration from ever being applied again. - Patch the CVEs: We are moving from Nginx Ingress v1.9.4 to v1.10.1 immediately. No exceptions. If a service breaks because of the upgrade, the service is what’s broken, not the upgrade.
- Log Rotation and Limits: No more
emptyDirfor logs withoutsizeLimit. Every sidecar and ingress pod will have a strict 1GB limit on log volumes. If the logs exceed that, they get rotated or dropped. A dropped log is better than a dropped cluster.
Phase 2: Dismantling the Distributed Monolith
- Hard Timeouts and Retries: Every service-to-service call must implement a strict timeout (max 2 seconds) and a maximum of 2 retries with exponential backoff. If you don’t have this, your code doesn’t go to production.
- Circuit Breaker Validation: We are running “Chaos Wednesdays.” We will manually trip circuit breakers in the staging environment to ensure that when
pricing-servicegoes down,checkout-servicestill allows users to see their cart, even if the prices are slightly stale. - Dependency Mapping: We need a real-time graph of service dependencies. If a “minor” change in the ingress can take down the database connector, we don’t have microservices; we have a very expensive, very slow monolith.
Phase 3: Redefining “DevOps Best” Practices
The term “devops best” has become a shield for mediocrity. We need to stop focusing on “velocity” and start focusing on “resilience.”
1. Mandatory Post-Mortems: Every P0 incident requires a post-mortem where the person who pushed the change explains the failure to the entire engineering org. Not to shame them, but to ensure the “scars” are shared.
2. Infrastructure as Code is Code: We need to treat Terraform with the same rigor as Java or Go. That means unit tests for modules, integration tests in a sandbox environment, and a “stop the line” mentality when a linting rule fails.
3. Observability for Humans: We are deleting 50% of our Grafana dashboards. They are noise. We will focus on the “Four Golden Signals”: Latency, Traffic, Errors, and Saturation. If a dashboard doesn’t help me find a root cause in five minutes at 3:00 AM, it’s garbage.
Then the pager went off again. It was a low-priority alert for a staging environment. I silenced it, finished my fourth cup of black coffee, and started writing this.
We didn’t just lose $2M. We lost the trust of our customers and the sanity of the SRE team. If you want to follow “devops best” practices, start by respecting the complexity of the systems you build. Stop treating your infrastructure like a playground and start treating it like the mission-critical foundation it is.
Now, if you’ll excuse me, I’m going to go sleep for fourteen hours. Don’t call me unless the data center is literally on fire. And even then, check the “devops best” wiki first. It probably says to use a fire extinguisher with at least v2.1.0 of the safety pin.
Final Status: Incident #8829-OMEGA closed. Root cause: Hubris. Remediation: Reality.
Related Articles
Explore more insights and best practices: