10 Essential DevOps Best Practices for Faster Delivery

It’s 4:12 AM. I’ve consumed nothing but lukewarm espresso and the bitter taste of a corrupted etcd cluster. My eyes feel like they’ve been scrubbed with industrial-grade sandpaper. The hum of the data center fans—even though I’m working remotely—is a phantom vibration in my skull. 72 hours. That’s how long it took to realize that our “resilient, cloud-native architecture” was actually just three raccoons in a trench coat holding a “Certified Kubernetes Administrator” certificate.

The industry is obsessed with “devops best” practices. We talk about them at conferences while sipping craft beer. We write white papers about them. But when the cascading failure starts, those best practices feel like bringing a toothpick to a supernova.

-- Logs begin at Thu 2024-05-23 01:14:02 UTC. --
May 23 04:01:12 k8s-master-01 etcd[1242]: store.index: error getting key: [wal: snapshot not found]
May 23 04:01:12 k8s-master-01 etcd[1242]: raft.node: 7c8f9b2e1a4d5c6f lost leader
May 23 04:01:13 k8s-master-01 kube-apiserver[1582]: E0523 04:01:13.124 filter.go:187] Request error: etcdserver: request timed out
May 23 04:01:15 k8s-master-01 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
May 23 04:01:15 k8s-master-01 systemd[1]: etcd.service: Failed with result 'exit-code'.
May 23 04:01:18 k8s-master-01 journalctl[2104]: [CRITICAL] Cluster partition detected. Quorum lost.
May 23 04:01:20 k8s-master-01 kubelet[1902]: E0523 04:01:20.442 pod_workers.go:1294] "Error syncing pod" err="failed to "StartContainer" for "istio-proxy" with CrashLoopBackOff"

Table of Contents

The 3 AM PagerDuty Scream

The sound of PagerDuty at 3 AM isn’t just an alert; it’s a physical assault. It’s the sound of your weekend dying and your sanity fraying. We were running Kubernetes v1.29.2. We thought we were safe. We had the latest patches. We had Helm 3.14 managing our releases. We followed the “devops best” guide for high availability.

The alert was simple: KubeAPILatencyHigh. Then KubeClientErrors. Then silence. The silence is worse. It means the monitoring system has stopped being able to talk to the cluster. When the watcher can’t watch, you’re flying a 747 in a storm with no cockpit lights.

I logged in to find the API server gasping for air. The culprit? A “minor” configuration change in our Istio 1.20 service mesh. A junior dev—bless his heart—tried to optimize the sidecar resources. He thought he was reducing toil. Instead, he triggered a race condition in the sidecar injection logic that caused every new pod to hang during the init phase. Because we had a “devops best” practice of aggressive auto-scaling, the cluster saw the hanging pods, thought the nodes were unhealthy, and started killing healthy pods to move them.

It was a digital circular firing squad.

The Fallacy of the “Golden Signal”

We worship at the altar of the Four Golden Signals: Latency, Traffic, Errors, and Saturation. But here’s the truth: signals lie. Your Prometheus dashboard can show 0% error rates while your entire database layer is silently corrupting blocks because the underlying EBS volume is hitting an IOPS limit that isn’t being reported correctly through the abstraction layers.

I spent four hours staring at this PromQL query, trying to understand why our ingress was 504ing:

histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_workload))

The latency looked fine. Why? Because the requests weren’t even reaching the destination. They were dying in the istio-proxy sidecar. The “devops best” approach to monitoring often focuses on the application layer while ignoring the plumbing. I had to drop down to the kernel level. I had to look at net.ipv4.tcp_tw_reuse.

In a high-churn environment like Kubernetes v1.29.2, especially with Istio 1.20’s aggressive connection pooling, you will exhaust your ephemeral ports faster than you can say “microservices.” If net.ipv4.tcp_tw_reuse isn’t set to 1, your sockets sit in TIME_WAIT for 60 seconds. With 5,000 requests per second, you’re dead in minutes. We were following “devops best” practices for container security, which meant our containers were unprivileged and couldn’t modify kernel parameters. A security win, a production catastrophe.

The Helm Chart of Doom and the YAML Abyss

Helm 3.14 is a powerful tool. It’s also a loaded shotgun aimed at your foot. We use Helm to package everything. Our “devops best” CI/CD pipeline automatically triggers a helm upgrade --install on every merge to main.

The failed deployment was a classic. A nested template in the values.yaml had a typo. Not a syntax error—Helm would have caught that. It was a logic error. A resource limit was set to 100m instead of 1000m for the core authentication service.

$ kubectl get pods -n kube-system
NAME                                         READY   STATUS             RESTARTS         AGE
coredns-78fcdf6894-2bz77                     1/1     Running            0                42h
etcd-k8s-master-01                           0/1     CrashLoopBackOff   142 (3m ago)     12h
istio-ingressgateway-5f66c66477-8vj9z        0/1     Running            0                14m
kube-apiserver-k8s-master-01                 1/1     Running            12               12h
kube-controller-manager-k8s-master-01        1/1     Running            5                12h
kube-proxy-l9w2v                             1/1     Running            0                42h
kube-scheduler-k8s-master-01                 1/1     Running            8                12h
auth-service-v2-77899658c-x92ll              0/1     OOMKilled          5 (2m ago)       10m
auth-service-v2-77899658c-z4pqr              0/1     OOMKilled          5 (2m ago)       10m

The OOMKills started immediately. But because of our “devops best” rolling update strategy, the deployment didn’t fail immediately. It slowly replaced healthy pods with these starved ones. By the time the health checks started failing, the “blast radius” had covered 80% of our traffic.

We tried to roll back. But wait! The etcd cluster was already under heavy load from the constant pod churning. When Helm tried to query the release history to perform the rollback, the etcd leader lost quorum. The state was locked. We couldn’t deploy, we couldn’t roll back, and we couldn’t even see what was running.

Terraform Drift and the Death of State

While the cluster was burning, I tried to scale the underlying node group using Terraform. We’re supposed to have “Infrastructure as Code.” That’s the “devops best” way, right?

I ran the plan. It was a nightmare.

$ terraform plan
Error: Provider produced inconsistent final plan

When expanding the plan for module.eks.aws_eks_node_group.workers[0], the 
provider allowed the plan to include values that were not in the 
configuration. This is a bug in the provider, or the state has drifted 
beyond recovery.

Objects have changed outside of Terraform:
  # module.eks.aws_eks_node_group.workers[0] has been changed 
  ~ resources {
      - remote_access_security_group_id = "sg-0a1b2c3d4e5f6g7h8" -> null
    }

You didn't use the -out option to save this plan, so Terraform
can't guarantee to take exactly these actions if you run "terraform apply" now.

Drift. The silent killer. Someone—probably during the last incident—had manually tweaked the security groups in the AWS console to “just make it work.” Now, in the middle of a 72-hour war room, Terraform wanted to destroy and recreate the entire node group to fix the drift.

This is the reality of “devops best” practices. We build these elaborate systems of automation, but we don’t account for human desperation. When the site is down and the CEO is screaming, nobody cares about the Terraform state file. They care about the ping response. And then, six months later, the SRE on call (me) pays the price for that “temporary” fix.

The Istio Mesh Grinder: When mTLS Becomes a Noose

Istio 1.20 promised better performance and simplified management. What they didn’t mention is that if your certificates expire or if the istiod control plane becomes unreachable during a massive scale-up event, your entire internal network turns into a collection of isolated islands.

We had mTLS (Mutual TLS) enabled globally. It’s a “devops best” practice for Zero Trust security. But mTLS requires a handshake. Handshakes require CPU. When our pods were already CPU-starved because of the Helm chart error, the mTLS handshakes started timing out.

The application logs were a mess of Connection reset by peer and SSL_ERROR_SYSCALL. We spent six hours debugging the database before realizing the database was fine—the application just couldn’t prove its identity to the proxy sitting two inches away from it.

I had to manually bypass the mesh for the auth service. I felt like a traitor to the “devops best” cause. I was punching holes in our security model just to get the login page to load. But that’s the job. SRE isn’t about maintaining a pristine architecture; it’s about keeping the blood flowing in a body that’s trying to die.

The Post-Mortem Lie

We’re going to have a post-mortem on Monday. It will be “blameless.” We will talk about “process improvements” and “automated remediation.” We will use words like “resilience” and “scalability.”

But the truth won’t be in the official document. The truth is that we are over-complicating our systems to the point of incomprehensibility. We are layering abstraction upon abstraction (Helm on Kubernetes on Terraform on AWS) and then wondering why we can’t find the root cause when things break.

“Devops best” practices have become a checklist for compliance rather than a philosophy for reliability. We prioritize “velocity” (pushing broken code faster) over “stability” (making sure the code actually works). We automate the “how” but we’ve forgotten the “why.”

Why are we using a service mesh for a three-tier app?
Why are we running our own etcd instead of using a managed service?
Why do we have 400 microservices when 10 monoliths would do?

The answer is always the same: because it’s “devops best” practice. Because it looks good on a resume. Because we’re afraid of being seen as “legacy.”

A Letter to the Junior Dev who Pushed the Commit

Kid, I saw your name in the git blame.

Don’t apologize. Don’t feel bad. You did exactly what the “devops best” documentation told you to do. You tried to optimize. You tried to follow the “GitOps” workflow. You aren’t the problem. The system that allowed a single character change in a YAML file to take down a global infrastructure is the problem.

But since you’re new here, and I’m too tired to be polite, here is some advice that you won’t find in any “devops best” handbook:

Trust nothing. Not the logs, not the dashboards, and especially not the “Status: Green” page of your cloud provider. If the users are complaining, the system is broken, regardless of what Prometheus says.
Understand the plumbing. Learn how the Linux kernel handles packets. Learn what a file descriptor is. Learn how TCP works. When the abstractions fail—and they will fail—the kernel is the only thing that will tell you the truth.
Read the source code. Don’t just copy-paste Helm charts. Open the templates. See what they’re actually doing to your manifests. Most “devops best” charts are bloated messes of conditional logic that no human can fully comprehend.
Simplicity is a feature. If you can solve a problem with a bash script, don’t use a Kubernetes operator. If you can solve it with a cron job, don’t use a distributed task queue. Every tool you add is another thing that will wake me up at 3 AM.
The “Blast Radius” is your only metric. Before you push anything, ask yourself: “If this goes wrong, how many people will I hurt?” If the answer is “everyone,” your deployment strategy is garbage, no matter how many “devops best” boxes it checks.
On-call is a tax on your soul. Don’t let the company tell you it’s “part of the culture.” It’s a high-stress, high-stakes responsibility that leads to burnout. Protect your sleep. If the system is so fragile that it requires constant human intervention, it’s not “automated”—it’s “manually operated by a tired person.”

I’m going to sleep now. I’ve deleted Slack from my phone. If the cluster dies again, let it. The “devops best” practices will surely save it while I’m dreaming of a world without YAML.

The servers are still warm. The etcd logs are finally quiet. But the regret? That stays. We’ve built a monster, and we call it “modern infrastructure.”

God help us all when the next “minor update” drops.

Technical Appendix for the Masochists:

To recover the cluster, we had to manually rebuild the etcd quorum by injecting a new member and forcing a snapshot restore. We also had to patch the istio-sidecar-injector ConfigMap to include a default terminationGracePeriodSeconds because the pods were being killed before they could flush their logs, making debugging impossible.

If you’re running K8s 1.29.2 with Istio 1.20, check your max_map_count and tcp_tw_reuse. Don’t wait for the War Room. Do it now.

# Current kernel tuning for the "survivor" nodes
sysctl -w net.core.somaxconn=32768
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w fs.file-max=1000000

And for the love of all that is holy, stop using latest tags in your Helm charts. You aren’t being “agile”; you’re being a martyr.

The “devops best” way is often just the most expensive way to fail. Redemption only comes when you stop believing the marketing and start looking at the packets.

I’m out. Don’t page me.

Explore more insights and best practices: