10 DevOps Best Practices for Faster Software Delivery

DevOps Best Practices: Why Your Pipeline is a Liability

It was 3:14 AM on a Tuesday in 2018. I was staring at a terminal window, watching a Jenkins pipeline spin in a loop. We were migrating our core payment processing service to a new Kubernetes cluster in us-east-1. I had written a “clever” shell script that used sed to inject environment variables into a YAML manifest before running kubectl apply. I thought I was being efficient. I wasn’t. A malformed string caused the script to wipe the spec.selector field from the Deployment. Kubernetes, doing exactly what I told it to do, orphaned the existing pods and started spinning up new ones that couldn’t find their target. Traffic to api.stripe.com started failing. Our error rates hit 100%. The “clever” script had effectively deleted our production environment’s ability to route traffic.

I spent the next four hours manually rebuilding the state while my manager breathed down my neck on a Zoom call. That night, I learned that “DevOps” isn’t about tools, scripts, or being clever. It’s about building systems that are boring, predictable, and resistant to human stupidity. If your deployment process requires a “hero” to stay awake and watch the logs, you don’t have a DevOps culture; you have a hostage situation. Most blog posts will tell you that DevOps is about “breaking down silos” or “accelerating delivery.” I’m here to tell you that most devops best practices are actually about preventing you from setting your infrastructure on fire.

The Fallacy of the “Latest” Tag

Stop using :latest. Just stop. It is the single most dangerous habit in container orchestration. When you pull node:latest or python:3.9, you are playing Russian Roulette with your build’s reproducibility. One morning, the maintainers push a patch that changes a shared library, and suddenly your production build fails because of a glibc mismatch that didn’t exist ten minutes ago.

Immutability is the only way to maintain sanity. Every image you build should be tagged with a git commit SHA or a semantic version. Better yet, reference the image by its SHA-256 digest. This ensures that the bits you tested in staging are the exact same bits running in production. If you can’t guarantee that, your testing is a lie.

  • Reference images by digest: my-app@sha256:85755305246504ca827...
  • Never use imagePullPolicy: Always in production unless you enjoy random outages during node restarts.
# Bad Dockerfile
FROM node:latest
COPY . .
RUN npm install
CMD ["node", "index.js"]

# Better Dockerfile (Deterministic)
FROM node:20.11.0-bookworm-slim@sha256:69396f866416629...
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production
COPY src/ ./src/
USER node
CMD ["node", "src/index.js"]

Pro-tip: Use npm ci instead of npm install in your CI pipelines. It deletes the node_modules folder and installs the exact versions from your lockfile. It’s faster and prevents “it works on my machine” syndrome.

Infrastructure as Code (IaC) is Not Just “Scripts in Git”

Most teams think they are doing IaC because they have some Terraform files in a repository. Then you look at their state file, and it’s stored locally on a lead engineer’s laptop. Or worse, they have “drift”—someone logged into the AWS console and manually changed a security group rule to “fix a quick issue,” and now the code doesn’t match reality. When you run terraform plan, it wants to delete the manual change, and you’re too scared to run apply.

If you aren’t enforcing your infrastructure through a CI provider, you aren’t doing IaC. You’re doing “Manual Infrastructure with Extra Steps.” You need a remote state with locking. If two people try to run Terraform at the same time and you don’t have locking, you will corrupt your state file. I have seen a corrupted state file turn a 5-minute update into a 3-day recovery effort involving terraform import and a lot of crying.

# terraform/backend.tf
terraform {
  backend "s3" {
    bucket         = "my-company-terraform-state"
    key            = "production/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock-table"
    encrypt        = true
  }
}

The dynamodb_table is not optional. It’s the only thing standing between you and a race condition that nukes your VPC. Also, stop using chmod 777 on everything because you’re frustrated with permissions. I once found a production S3 bucket with public read/write access because a junior dev “couldn’t get the IAM policy to work.” Use the principle of least privilege. It’s annoying, it’s slow, but it’s the only way to avoid being the lead story on Krebs on Security.

The Alpine vs. Debian-Slim Debate

The hype cycle will tell you to use Alpine Linux for everything because it’s “small.” A 5MB base image sounds great until you realize it uses musl instead of glibc. If you are running Python, Ruby, or Node.js apps that rely on C extensions (like pandas, bcrypt, or grpc), you are going to have a bad time. You will spend hours debugging weird segmentation faults or watching your build times triple because you have to compile every dependency from source since there are no pre-built wheels for musl.

I take a hard stand here: Use debian-slim (specifically the bookworm or bullseye variants). It’s 30MB larger, but it’s compatible with almost everything. Disk space is cheap; engineering time spent debugging ldd errors is expensive. Your devops best practices should prioritize stability over shaving 20MB off an image that’s going to be cached on the node anyway.

  • Alpine: Good for Go or Rust (statically linked binaries).
  • Debian-Slim: Good for everything else.
  • Distroless: Great for security, but a nightmare to debug when you need to exec into a pod to see why a config file is missing.

CI/CD: The “Continuous” Part is a Lie

We call it Continuous Deployment, but for most, it’s “Continuous Anxiety.” A common mistake is building a pipeline that is too long. If your pipeline takes 45 minutes to run, developers will start batching commits. Batching commits makes it impossible to identify which change broke the build. Your CI should provide feedback in under 10 minutes. If it doesn’t, you need to parallelize your tests or fix your slow-ass Docker builds.

One of the biggest devops best practices is the “Build Once, Deploy Many” rule. You should build your artifact (Docker image, JAR, binary) at the start of the pipeline. That same artifact should move through staging, UAT, and production. If you are rebuilding the code for each environment, you are not testing what you are deploying. You are testing a *copy* of what you are deploying. Subtle differences in build environments can and will break your app.

# .github/workflows/deploy.yml
jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.vars.outputs.tag }}
    steps:
      - name: Build and Push
        run: |
          TAG=$(git rev-parse --short HEAD)
          docker build -t my-reg/app:$TAG .
          docker push my-reg/app:$TAG
          echo "tag=$TAG" >> $GITHUB_OUTPUT

  deploy-staging:
    needs: build
    environment: staging
    runs-on: ubuntu-latest
    steps:
      - name: Update K8s
        run: |
          sed -i "s|image:.*|image: my-reg/app:${{ needs.build.outputs.image_tag }}|" k8s/deploy.yml
          kubectl apply -f k8s/deploy.yml

Note the use of git rev-parse --short HEAD. This links the deployment directly to a specific commit. If production goes down, I know exactly which lines of code are responsible. No guessing. No “I think it was the merge from yesterday.”

Observability: Stop Looking at Dashboards

Dashboards are for managers. Alerts are for engineers. If you have a dashboard with 50 widgets, you aren’t monitoring; you’re painting. You cannot look at 50 widgets during an incident. You need to define your SLIs (Service Level Indicators) and SLOs (Service Level Objectives). Focus on the “Four Golden Signals”: Latency, Traffic, Errors, and Saturation.

High cardinality is the silent killer of Prometheus. If you start adding user_id or order_id as a label in your Prometheus metrics, you will blow up your TSDB (Time Series Database). Your Prometheus instance will start consuming 64GB of RAM and then OOM-kill itself right when you need it most. Keep your labels low-cardinality. Use logs or traces for high-cardinality data.

Note to self: Check the prometheus_tsdb_head_series metric. If it’s climbing linearly, someone added a uuid label to a counter. Find them. Educate them. Or take away their keyboard.

Real-world example of a useful Prometheus query for an SLO (99th percentile latency over 5 minutes):

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{job="api-server"}[5m]))) > 0.5

If this query returns a result, someone is getting paged. It’s actionable. It’s clear. It’s not a “pretty graph” that no one looks at.

The Database Migration Nightmare

Everyone forgets the database. You can roll back a container in 10 seconds. You cannot roll back a DROP COLUMN on a 5TB table in 10 seconds. DevOps best practices dictate that database migrations must be decoupled from code deployments. Your code should always be compatible with N-1 version of the database schema.

If you need to rename a column, it’s a three-step process over two deployments:

  1. Add the new column, and write to both the old and new columns.
  2. Backfill the data from the old column to the new one.
  3. Update the code to read from the new column, then delete the old column in a separate migration weeks later.

If you try to do this in one go, and the code deployment fails, you are stuck. You can’t roll back the code because the old code doesn’t know about the new schema. This is how data corruption happens. This is how you lose your job.

Secret Management: Base64 is Not Encryption

I am still shocked by how many people think that Kubernetes Secrets are “secure.” They are just Base64 encoded strings. Anyone with get secrets access can read them. If you check your .env files into Git, you might as well post your AWS secret keys on Twitter. Use a real secret manager like HashiCorp Vault, AWS Secrets Manager, or at the very least, sops to encrypt your secrets at rest within your Git repo.

# Example of using sops to encrypt a secret file
sops --encrypt --gcp-kms projects/my-project/locations/global/keyRings/my-ring/cryptoKeys/my-key secret.yaml > secret.enc.yaml

This allows you to keep your configuration in Git (GitOps) without exposing the sensitive bits. When the CI/CD pipeline runs, it uses a service account with permission to decrypt the file. It’s a bit more friction, but it prevents the “oops, I leaked the Stripe API key” post-mortem.

The “Gotcha”: The Hidden Cost of Managed Services

Managed services (RDS, EKS, Managed Kafka) are great until they aren’t. People think “Managed” means “I don’t have to worry about it.” Wrong. Managed means “I don’t have to manage the hardware, but I still have to manage the configuration.” I once saw a team spend $20,000 in a single month because they enabled “Detailed Monitoring” on 500 CloudWatch metrics they never looked at. Or the time an “Auto-scaling” group scaled to 200 instances during a DDoS attack, costing a fortune because there was no upper limit set.

You must set guardrails. Max instance counts, budget alerts, and TTLs (Time To Live) on experimental resources. In a cloud-native world, your devops best practices must include “Cloud Financial Management” (FinOps). If you don’t, your CFO will become your most frequent on-call page.

  • Always set max_size on Auto Scaling Groups.
  • Use Taint and Tolerations in K8s to keep expensive GPU workloads from running your simple web-cron jobs.
  • Delete your unused EBS volumes. They are the “vampire power” of AWS.
  • Set a 7-day retention policy on your non-production logs. You don’t need 2-year-old logs for a dev environment that doesn’t exist anymore.
  • Audit your S3 storage classes. Moving old logs to Glacier can save 80% on storage costs.
  • Use Spot instances for non-critical background jobs, but ensure your app can handle a SIGTERM gracefully.

The Human Element: On-Call is a Feedback Loop

If your developers aren’t on-call for the code they write, they will never write stable code. This is the core of DevOps. When a developer gets woken up at 2 AM because their new feature is throwing 500 errors, they become very motivated to write better tests and include better error handling. If the SRE team is the only one getting paged, the developers have no incentive to improve. They will keep throwing “features” over the wall, and the SREs will keep burning out.

But on-call shouldn’t be a punishment. If a team is getting paged more than twice a week, the sprint should be stopped, and the next two weeks should be dedicated entirely to “Reliability Work.” No new features. Just fixing the technical debt that is causing the pages. This is how you build a sustainable culture. You cannot “DevOps” your way out of a toxic work environment that prioritizes velocity over stability.

YAML-Hell and the Complexity Trap

We’ve traded “DLL Hell” for “YAML Hell.” Between Kubernetes manifests, Helm charts, and CI/CD definitions, we are drowning in indentation-sensitive configuration. My advice? Keep it as flat as possible. Avoid deeply nested Helm charts with 500 lines of values.yaml. If you need a PhD to understand how a service is deployed, your abstraction is too leaky.

I prefer Kustomize over Helm for internal apps. It’s just plain YAML with overlays. No complex templating logic. No {{ if .Values.global.enabled }} blocks that make your eyes bleed. It’s easier to debug and easier to audit. Remember: The goal of DevOps is to reduce cognitive load, not increase it.

# kustomization.yaml
resources:
  - ../base
patchesStrategicMerge:
  - replica_count.yaml
  - env_vars.yaml

It’s simple. It’s readable. It works.

Testing the Un-testable

Unit tests are fine, but they won’t tell you if your IAM role has the right permissions to write to DynamoDB. For that, you need integration tests in a real environment. Tools like LocalStack are okay, but nothing beats a “Sandbox” AWS account where you can run terraform apply and run actual functional tests against real AWS APIs. Yes, it costs a few dollars. No, it’s not as expensive as a production outage.

And for the love of all that is holy, test your backups. A backup that hasn’t been restored is just a theoretical exercise. I’ve seen companies lose weeks of data because they were “backing up” to a corrupted S3 bucket for months and never checked if the tar files were actually valid. Schedule a “Restoration Day” once a quarter. If you can’t bring your system up from scratch in a new region in under 4 hours, you don’t have a disaster recovery plan; you have a hope.

DevOps is the practice of being relentlessly disciplined about the boring stuff. It’s about pinning versions, locking state, limiting permissions, and actually reading the documentation before you copy-paste from StackOverflow. It’s not flashy. It won’t get you a keynote at a conference. But it will let you sleep through the night. And in this industry, that is the only metric that matters.

Stop chasing the hype and start fixing your :latest tags.

Leave a Comment