10 DevOps Best Practices to Streamline Your Workflow

text
stderr: 2024-05-14T03:14:02.882Z ERROR [service-api-gateway] [go-routine-442] connection reset by peer: context deadline exceeded (client_timeout=5000ms)
stderr: 2024-05-14T03:14:02.883Z FATAL [service-api-gateway] [main] failed to initialize upstream circuit breaker: redis connection refused at 10.96.0.15:6379
stderr: 2024-05-14T03:14:03.101Z INFO [kernel] [oom-killer] invoked: oom_score_adj=999, total_vm=18446744, anon_rss=16223412, file_rss=0, shmem_rss=0
$ kubectl get pods -n production -l app=order-processor
NAME READY STATUS RESTARTS AGE
order-processor-7f8d9b6c5d-2v8xl 0/1 CrashLoopBackOff 14 (2m ago) 42m
order-processor-7f8d9b6c5d-4k2p9 0/1 Terminating 0 11s
order-processor-7f8d9b6c5d-9z1q3 0/1 OOMKilled 3 (5m ago) 18m
order-processor-7f8d9b6c5d-m5n6b 1/1 Running 0 4s
order-processor-7f8d9b6c5d-p0l9k 0/1 Pending 0 1s
$ _

The smell of a server room is a lie. Most of you work in "the cloud," which means the only thing you smell during a 48-hour outage is the stale, metallic scent of your own sweat, the ozone coming off an overtaxed MacBook Pro M2, and the cold dregs of a fourth double-espresso that lost its heat three hours ago. It was 03:14 AM on a Tuesday when PagerDuty decided my sleep was less important than a cascading failure in our Kubernetes v1.28.2 cluster. 

The logs above weren't just errors; they were the first gasps of a dying system. We had spent six months implementing "devops best" practices—or at least, the version of them sold by consultants who haven't touched a production terminal since the Obama administration. We had the service mesh, the sidecars, the GitOps pipelines, and the "self-healing" infrastructure. And yet, there I was, staring at a terminal screen that looked like a digital graveyard.

## The Fallacy of "Self-Healing" Infrastructure

The marketing brochures tell you that Kubernetes is "self-healing." They say that if a pod dies, the Controller Manager will just spin up another one. It sounds like magic. In reality, "self-healing" is just a fancy way of saying "infinite feedback loop of death." 

Our `order-processor` service, written in Go 1.21.3, started hitting a memory limit. Why? Because some "thought leader" decided that we should use a dynamic cache that scales based on incoming request volume without a hard cap. When the cache hit the limit, the OOM (Out of Memory) killer stepped in. Kubernetes, doing exactly what the "devops best" guides told it to do, saw the pod die and immediately tried to restart it.

But here’s the kicker: the new pod tried to rehydrate its cache from PostgreSQL 15.4 upon startup. This put a massive spike on the DB. The DB slowed down. The other 40 pods in the cluster saw the latency increase and their own internal queues started backing up, consuming more memory. Within ten minutes, the entire cluster was a churning mess of `CrashLoopBackOff` and `Pending` states. The "self-healing" mechanism was just pouring gasoline on a forest fire.

```bash
$ strace -p 12044 -e network
[pid 12044] connect(12, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("10.0.45.12")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 12044] select(13, NULL, [12], NULL, {tv_sec=0, tv_usec=500000}) = 0 (Timeout)
[pid 12044] close(12) = 0
[pid 12044] write(2, "DB Connection Timeout\n", 22) = 22

The strace output told the real story. The application wasn’t failing because of bad logic; it was failing because the network stack was screaming for mercy. We had so many “sidecars” from Istio 1.18 injecting iptables rules that the kernel was spending more time traversing NAT tables than actually routing packets.

YAML Indentation: The Silent Killer

We managed our infrastructure using Terraform v1.5.7. We were told that “Infrastructure as Code” would prevent configuration drift. What they didn’t tell us is that it also allows you to codify a disaster and deploy it to three regions simultaneously with a single git push.

The “devops best” approach we followed involved a complex hierarchy of Terraform modules. One junior dev, trying to be helpful, updated the Horizontal Pod Autoscaler (HPA) configuration. They thought they were optimizing cost. Instead, they introduced a logic error in the YAML that Terraform happily applied because, technically, the syntax was valid.

# The "Fix" that broke the world
resource "kubernetes_horizontal_pod_autoscaler_v2" "order_processor_hpa" {
  metadata {
    name      = "order-processor-hpa"
    namespace = "production"
  }

  spec {
    max_replicas = 500 # "Devops best" says scale wide, right?
    min_replicas = 10

    metrics {
      type = "Resource"
      resource {
        name   = "cpu"
        target {
          type                = "Utilization"
          average_utilization = 10 # This is where the nightmare started
        }
      }
    }
  }
}

Look at that average_utilization = 10. The idea was to keep the pods “cool.” But in a K8s v1.28.2 environment, setting a 10% CPU utilization target for an HPA on a service that has a high startup cost is a death sentence. As soon as a pod started up and did its initial Go runtime initialization, it spiked past 10%. The HPA saw this and immediately triggered more pods. Those pods started up, spiked, and triggered even more pods.

We hit our AWS EC2 service limits in four minutes. Terraform v1.5.7 tried to reconcile the state, but because the AWS API was rate-limiting us due to the sheer volume of RunInstances calls, the Terraform state file became locked and corrupted. We couldn’t scale down. We couldn’t scale up. We were stuck in a state of permanent, expensive explosion.

The Istio Sidecar Tax and the Death of Latency

By hour 14, we were deep in the guts of Istio 1.18. If you want to take a simple network problem and turn it into a PhD-level research project, install a service mesh. We were told Istio would give us “observability.” What it actually gave us was 15ms of overhead on every single internal hop and a debugging nightmare that made me want to go back to writing COBOL on a mainframe.

The Envoy proxies were failing their readiness probes because the control plane (istiod) was overwhelmed by the churn of pods being created and destroyed by the rogue HPA. When an Envoy proxy isn’t “ready,” it stops passing traffic. But the application inside the pod is still “running.”

So, we had a situation where kubectl said the pods were Running, but no traffic was getting through because the sidecar was pouting in the corner. We had followed the “devops best” advice of “mutual TLS everywhere,” which meant we couldn’t even easily sniff the traffic with tcpdump to see what was happening. We were blind, flying a plane that was actively shedding its wings, while the “observability” dashboard showed everything was “Green” because the dashboard itself couldn’t reach the metrics exporter.

Terraform 1.5.7 and the State-File Suicide Pact

When we finally tried to manually intervene and kill the HPA, Terraform decided to remind us who was really in charge. Because we were using a remote S3 backend for our state file with DynamoDB locking, and because our NAT gateways were saturated, Terraform couldn’t release the lock.

Error: Error acquiring the state lock
Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        7b2345-e123-4456-8890-abcdef123456
  Path:      prod/infrastructure.tfstate
  Operation: OperationTypeApply
  Who:       jenkins-worker-01
  Version:   1.5.7
  Created:   2024-05-14 03:15:22.123456 +0000 UTC
  Info:      

I had to manually go into the AWS Console—the ultimate walk of shame for an SRE—to delete a DynamoDB entry just so I could run a command to stop the bleeding. This is the “devops best” reality: you spend three hours fighting your automation tools just so you can spend five minutes fixing the actual problem. The tools that were supposed to save us time had become the very obstacles preventing us from restoring service.

We had built a “modern” stack that was so brittle that a single YAML indentation error or a minor network hiccup could trigger a global collapse. We valued the “purity” of our GitOps workflow over the actual stability of the platform. We were so obsessed with “immutable infrastructure” that we forgot how to actually log into a box and fix a configuration file.

CoreDNS: Where Packets Go to Die

By hour 26, the database was back up, the HPA was deleted, and the Terraform lock was cleared. But the application still couldn’t talk to the database. Why? Because CoreDNS 1.10.1 had decided to give up on life.

In Kubernetes, every time a service tries to resolve a hostname, it goes through CoreDNS. If you use the default ndots:5 setting in your /etc/resolv.conf (which is the K8s default), every single DNS lookup for postgres.production.svc.cluster.local results in five different search attempts.

  1. postgres.production.svc.cluster.local.production.svc.cluster.local (Fail)
  2. postgres.production.svc.cluster.local.svc.cluster.local (Fail)
  3. postgres.production.svc.cluster.local.cluster.local (Fail)
  4. postgres.production.svc.cluster.local (Success… eventually)

Under the massive load of our “self-healing” restart loop, CoreDNS was being hammered with millions of junk queries. It reached its internal cache limit and started dropping UDP packets. Because UDP is connectionless, the application just sat there waiting for a response that was never coming.

I spent six hours tuning net.core.somaxconn and net.ipv4.tcp_max_syn_backlog on the worker nodes, trying to give the kernel enough breathing room to handle the DNS flood.

# Tuning the kernel while the world burns
sysctl -w net.core.somaxconn=4096
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

This isn’t “DevOps.” This is digital archaeology. You’re digging through layers of abstraction—Docker, Kubernetes, Containerd, the Linux Kernel—trying to find the one tiny setting that some “best practice” guide forgot to mention. We had followed all the “devops best” advice on how to structure our namespaces, but nobody told us that the default DNS configuration in K8s is essentially a self-inflicted Denial of Service attack at scale.

The Human Defragmentation: 48 Hours of Toil

The “human cost” is a phrase people like to use in HR meetings, but they don’t know what it means. It means watching your lead developer start crying in a Zoom call because they’ve been awake for 36 hours and they just accidentally deleted the wrong S3 bucket. It means the “error budget” isn’t a number on a Grafana dashboard; it’s the amount of sanity you have left before you quit the industry to go farm goats in Vermont.

We talk about MTTR (Mean Time To Recovery) like it’s a metric to be optimized for a quarterly review. But when you’re in the middle of it, MTTR is just a measure of how long you can ignore your family while you stare at a kubectl logs -f output.

The “devops best” culture encourages this by rewarding “firefighters.” We celebrate the person who stayed up all night to fix the cluster, but we never ask why the cluster was so easy to break in the first place. We’ve built systems that are so complex that no single human can understand the entire request path. From the CloudFront edge to the ALB, to the Istio Ingress, to the Envoy Sidecar, to the Go binary, to the PostgreSQL driver, to the actual disk—there are a thousand points of failure, and we’ve automated them all to fail simultaneously.

The Heartbreak of TCP Keep-Alives

If you really want to know what’s wrong with modern infrastructure, look at TCP keep-alives. We had a specific issue where connections between our service and the database were being silently dropped by the AWS Network Load Balancer (NLB) because they were idle for more than 350 seconds.

The “devops best” advice for our database pool (using jackc/pgx in Go) was to keep a large number of idle connections ready to handle traffic spikes. But the NLB doesn’t care about your “best practices.” It sees an idle connection and it kills it. However, it doesn’t send a FIN or RST packet to the client. The client thinks the connection is still open.

So, when the application finally tries to use that connection, it sends a packet into the void. It waits. And waits. Until the kernel’s TCP timeout kicks in, which, by default on Ubuntu 22.04 LTS, is way too long.

# The settings that actually saved us, not the YAML
cat /proc/sys/net/ipv4/tcp_keepalive_time      # Default 7200
cat /proc/sys/net/ipv4/tcp_keepalive_intvl     # Default 75
cat /proc/sys/net/ipv4/tcp_keepalive_probes    # Default 9

We had to aggressively tune these down. We had to force the kernel to realize the connection was dead in seconds, not hours.

sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6

Why isn’t this in the “devops best” guides? Because it’s not sexy. It’s not a new tool you can install with a Helm chart. It’s just boring, old-school systems engineering. But in a world of “Cloud Native” abstractions, we’ve forgotten that underneath it all, it’s still just Linux kernels talking to each other over a lossy network.

We spent 48 hours rediscovering the fundamentals of networking because our high-level abstractions had failed us. We had built a cathedral of YAML on a foundation of sand. We used Istio 1.18 to manage “traffic shifting” for canary deployments we weren’t even doing yet, while our basic TCP stack was misconfigured. We used Terraform 1.5.7 to manage “multi-region failover” while our single-region DNS was collapsing under its own weight.

The “blast radius” of our failure was massive not because the initial error was big, but because our “devops best” automation ensured that every small error was amplified and distributed across the entire system. We had automated our own destruction.

By the time the last service was stable and the last PagerDuty alert was acknowledged, I didn’t feel a sense of accomplishment. I felt a profound sense of exhaustion and a deep, burning cynicism toward the next person who tries to sell me a “seamless” way to manage microservices.

We don’t need more tools. We don’t need more “best practices” from people who have never had to debug a kernel panic at 4 AM. We need simplicity. We need systems that fail gracefully instead of “healing” themselves into a coma. We need to stop pretending that adding another layer of abstraction is the solution to the problems caused by the previous layer of abstraction.

The next time you see a blog post about the “top 10 devops best practices for 2024,” do yourself a favor: close the tab, open a terminal, and check your TCP timeouts. It might just save your sleep.

DevOps isn’t a job title, it’s a suicide note written in YAML.

Leave a Comment