10 Essential DevOps Best Practices for 2024 Success

The smell of ozone didn’t come from a short circuit. It came from the laser printer in the corner of the NOC, churning out five hundred pages of stack traces because some “architect” decided that Kubernetes v1.16 was the perfect time to finally delete the extensions/v1beta1 API group without checking our legacy Helm 2.14 charts.

It was 3:14 AM. The pager on my belt—a physical object I keep because I don’t trust phone notifications—was vibrating so hard it left a bruise on my hip. Every single deployment in the production cluster was failing. The API server was throwing 404s like a pitcher in the bottom of the ninth.

$ kubectl apply -f legacy-deployment.yaml
error: unable to recognize "legacy-deployment.yaml": no matches for kind "Deployment" in version "extensions/v1beta1"
$ helm install my-app ./charts/my-app
Error: release my-app failed: admission webhook "validate.nginx.ingress.kubernetes.io" denied the request: 
  Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": 
  Post https://ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=30s: 
  service "ingress-nginx-controller-admission" not found

That was the night I realized “innovation” is just a polite word for “breaking things that worked yesterday.” We spent twelve hours rewriting YAML manifests by hand, caffeinated on stale coffee and spite. This is my manifesto. It is written in the blood of weekend plans and the scar tissue of failed migrations. If you want “devops best” practices, don’t look at a slide deck. Look at the wreckage.

Table of Contents

The YAML Cemetery

We were told that “Infrastructure as Code” would save us. They lied. They just traded manual errors for automated catastrophes. In 2017, we adopted Helm v2.11. We thought Tiller—the server-side component that ran with full cluster-admin privileges—was a gift. It was a Trojan horse.

I watched a junior engineer accidentally run helm delete --purge on a production namespace because the context was set incorrectly in their terminal. Tiller, being a mindless drone, obliged. It didn’t just delete the deployments; it wiped the ConfigMaps, the Secrets, and the persistent volume claims. The data was gone. The “code” had executed a scorched-earth policy on our database.

The correction is brutal: Stop treating YAML like a programming language and start treating it like a liability. The “devops best” approach isn’t more abstraction; it’s more validation. We moved to Helm 3.x specifically to kill Tiller. We implemented kube-linter and conftest to ensure no manifest enters the cluster without a resource limit or a non-root user ID.

# The sin: No limits, running as root
apiVersion: apps/v1
kind: Deployment
metadata:
  name: risky-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: my-app:latest # Never use latest. Ever.

The “devops best” way to handle this is to treat your cluster like a high-security prison, not a playground. Use Helm 3.12+ for its improved security posture. Use OPA Gatekeeper to reject any pod that doesn’t have a resources.limits.cpu set. If it isn’t constrained, it isn’t production-ready.

The Ghost in the Jenkins Pipeline

Jenkins v2.204 is a haunted house. I have spent more time debugging Groovy scripts than I have spent with my family. We had a “Golden Pipeline” that everyone used. It was 4,000 lines of shared library code that nobody understood.

One Tuesday, a plugin update for the “Git Plugin” broke the way environment variables were passed to shell steps. Suddenly, the DEPLOY_ENV variable was empty. The pipeline defaulted to the first entry in the script: production. We were running “test” builds against the production database for six hours before anyone noticed the “test” data—thousands of entries for “Mickey Mouse”—in the real customer tables.

The “devops best” fix for this isn’t a better Jenkins plugin. It’s the total elimination of stateful CI/CD workers. We burned the Jenkins VM to the ground and moved to ephemeral runners using Docker 20.10. Every build starts in a clean room. If the build needs a tool, it’s baked into the runner image. No more sudo apt-get install in the middle of a pipeline.

# The failure: Jenkins agent with "bit rot"
$ mvn clean install
[ERROR] Failed to execute goal ... java.lang.OutOfMemoryError: Java heap space
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        40G   40G    0G 100% / # Jenkins workspace filled with 2 years of junk

The correction: Use a tool like GitHub Actions Runner Controller or GitLab Runners on Kubernetes 1.25+. Set a ttlSecondsAfterFinished on your jobs. If a build agent lives longer than thirty minutes, it’s a liability. It’s a pet. Kill your pets.

The 1500-Byte Chokehold

Networking is the dark art that modern SREs ignore until the latency spikes. I remember a week in 2018 when our microservices started randomly timing out. Only on Tuesdays. Only between two specific racks in the data center.

We checked the logs. We checked Prometheus 2.2. We checked the application code. Everything looked “seamless”—a word I hate because it usually means “the seams are hidden until they burst.”

The culprit was the MTU (Maximum Transmission Unit). We were running an Overlay Network (Flannel) on top of an AWS VPC. The VPC had an MTU of 1500. The Flannel VXLAN encapsulation added 50 bytes of overhead. When a service sent a 1500-byte packet, the network dropped it because it became 1550 bytes and the “Don’t Fragment” bit was set.

# Debugging the silent killer
$ ping -s 1472 -M do 10.244.1.5
PING 10.244.1.5 (10.244.1.5) 1472(1500) bytes of data.
1480 bytes from 10.244.1.5: icmp_seq=1 ttl=64 time=0.842 ms

$ ping -s 1500 -M do 10.244.1.5
ping: local error: Message too long, mtu=1450

The “devops best” practice here is boring: Standardize your MTU at 1450 across the entire stack if you’re using any kind of encapsulation. Don’t trust the defaults. The defaults are designed for a perfect world, and we live in a dumpster fire. We spent three days running tcpdump on worker nodes to find those dropped packets. Now, we check MTU in our node-problem-detector configs.

The Cardinality Inferno

Prometheus 2.45 is a masterpiece, but in the hands of a “full-stack” developer, it’s a flamethrower aimed at your RAM. I once saw a Prometheus instance OOMKill (Out of Memory Kill) every ten minutes because someone decided to add user_id as a label to a http_request_duration_seconds histogram.

We had 100,000 users. Each user created a new time series. The TSDB (Time Series Database) exploded. The memory usage went from 8GB to 128GB in an hour.

# The sound of a dying Prometheus
$ journalctl -u prometheus -f
... prometheus[1234]: tsdb: compacting blocks: out of memory
... kernel: [98765.432] Out of memory: Kill process 1234 (prometheus) score 950 or sacrifice child

The correction: “Devops best” means strict label hygiene. You never, ever put high-cardinality data in labels. No IDs, no emails, no timestamps. If you need that level of detail, you use logs (and even then, be careful). We implemented prometheus-cardinality-exporter to alert us when any metric exceeded 10,000 unique series. We also started using recording rules to pre-aggregate data so the dashboards didn’t have to do the heavy lifting at 3:00 AM.

The State of Sin

The biggest lie of the cloud era is that “everything is stateless.” Tell that to the database.

In 2020, we tried running a high-traffic Postgres 12 cluster on Kubernetes 1.18 using EBS volumes. We thought the PersistentVolumeClaim system was magic. It wasn’t. During a routine node upgrade, the AWS API lagged. The volume didn’t detach from the old node fast enough. The new node tried to mount it and failed. The database was stuck in a “Multi-Attach Error” loop for four hours.

Events:
  Type     Reason              Age                From                Message
  ----     ------              ----               ----                -------
  Warning  FailedAttachVolume  5m (x15 over 30m)  attachdetach-controller  Multi-Attach error for volume "pvc-1234" Volume is already used by pod "postgres-0" on node "ip-10-0-1-5.ec2.internal"

The “devops best” way to handle state is to admit it’s a burden. If you must run databases in Kubernetes, use a dedicated Operator like Zalando’s Postgres Operator or CloudNativePG. But more importantly, use local NVMe storage for performance and use cross-region replication that doesn’t rely on the K8s control plane. We learned to stop trusting the “seamless” volume migration and started building application-level redundancy. If the pod dies, the standby takes over via a virtual IP or a service mesh (Istio 1.15+), not by waiting for a physical disk to move across the data center.

The Terraform Trap

Terraform 0.11 was a wild west. We had one massive state file for the entire production environment. If you wanted to change a security group rule, you had to run a terraform plan that checked 5,000 resources. It took twenty minutes. If two people ran it at the same time, the state lock would break, and you’d end up with a corrupted .tfstate file.

I remember the day the state file got corrupted and Terraform decided that the “fix” was to delete the production VPC and recreate it. I had to physically pull the Ethernet cable out of my laptop to stop the execution.

$ terraform apply
...
- aws_vpc.main
+ aws_vpc.main
  (force replacement)

Plan: 1 to add, 0 to change, 450 to destroy.
Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: ^C^C^C^C # PANIC

The correction: Small state files. One for networking, one for the EKS cluster, one for each application stack. Use Terraform 1.5+ with import blocks and moved blocks to refactor without destruction. “Devops best” is isolation. If the “code” goes rogue, it should only be able to burn down a single room, not the whole building. We also implemented atlantis to force all changes through Pull Requests, with mandatory peer review. No more “cowboy” applies from local machines.

The Manifesto of Reality

You don’t want innovation. You want a system that is so boring it puts you to sleep.

Real “devops best” practices aren’t about the latest tool. They are about the “Ops” side of the house that everyone ignores. It’s about having a sysctl.conf that is tuned for high-concurrency. It’s about knowing that conntrack tables can fill up and drop packets. It’s about understanding that every time you add a layer of abstraction (like a Service Mesh), you are adding a layer of failure that you will eventually have to debug with gdb or strace.

I’ve seen the “Cloud Migration Wars.” I’ve seen companies spend millions to move from VMs to Kubernetes, only to realize their application’s bottleneck was a single-threaded Python process from 2012.

Here is the truth: Your infrastructure is a leaky pipe. The water is your data. The pressure is your traffic. My job—and your job, if you want to survive—is to stop the leaks before the basement floods.

We don’t use “vibrant” dashboards to feel good. We use them to see the exact moment the pipe starts to crack. We don’t “empower” developers to ship faster; we give them a paved road with guardrails so thick they can’t drive off the cliff.

The Final Log: A Lesson in Humility

Last month, a deployment failed. A simple Nginx 1.25 update. The exit code was 1. The logs were cryptic.

$ kubectl logs nginx-7f8d9b6c5-x4z2p
2023/10/14 14:20:11 [emerg] 1#1: open() "/etc/nginx/nginx.conf" failed (13: Permission denied)
nginx: [emerg] open() "/etc/nginx/nginx.conf" failed (13: Permission denied)
$ kubectl get pod nginx-7f8d9b6c5-x4z2p -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'
1

The “marketing” version of DevOps would say we need a more “comprehensive” observability “landscape.”

The “battle-scarred” version of DevOps knew exactly what it was: Someone had updated the base image, and the new image ran as a non-root user, but the ConfigMap mount was still owned by root.

The fix:

securityContext:
  runAsUser: 101
  fsGroup: 101

One line of code. Ten years of experience to know where to look. That is the only “devops best” practice that matters: knowing how the machinery actually works under the grease and the rust.

Now, if you’ll excuse me, the pager is quiet. I’m going to try to get three hours of sleep before the next “innovation” breaks the world. Don’t call me unless the building is literally on fire. And even then, check the MTU first.

Explore more insights and best practices: