10 DevOps Best Practices for Faster Software Delivery

Stop Building Pipelines and Start Building Systems: A Decade of DevOps Regrets

It was 3:15 AM on a Tuesday in 2017. I was staring at a Grafana dashboard that looked like a heart monitor for a patient in cardiac arrest. We had just “automated” our deployment pipeline using a series of nested Jenkins Groovy scripts that someone—probably me—thought was clever. I pushed a change to a shared library, thinking it would only affect the staging environment. It didn’t. Because our “devops best” practices at the time involved a shared Jenkins master with global credentials, the script executed a terraform destroy on our production VPC. I watched, paralyzed, as 400 EC2 instances transitioned to ‘shutting-down’ in unison.

The recovery took fourteen hours. We didn’t have a backup of the Terraform state file because “S3 versioning is expensive.” We didn’t have a manual gate because “gates slow down velocity.” That night, I learned that most of what people call “DevOps” is just a high-speed way to shoot yourself in the foot. If you’re looking for a guide on how to use a specific tool to “transform your enterprise,” close this tab. This is about the technical trade-offs that actually keep the lights on when the hype dies down.

The Myth of the “Best Practice”

The industry loves the term “devops best.” It suggests there is a single, correct way to configure a YAML file. There isn’t. There are only trade-offs. Most documentation you read is written by Developer Advocates trying to show you the “Happy Path.” They show you how to deploy a “Hello World” app in five minutes. They never show you what happens when your node_modules folder hits 2GB and your CI runner runs out of disk space, or when your Kubernetes LivenessProbe starts killing healthy pods because of a temporary network blip to your database.

Real DevOps isn’t about tools. It’s about reducing the cognitive load on the person who gets paged at 3 AM. If your “automated” system is so complex that no one can debug it under pressure, it’s not a best practice; it’s a liability. We need to stop optimizing for “speed of setup” and start optimizing for “debuggability at scale.”

Pro-tip: If your CI/CD pipeline takes more than 10 minutes to run, your developers are already on Reddit. If it takes more than 20, they’ve forgotten what they were even trying to deploy.

CI/CD: The Pipeline is Not the Product

Most CI pipelines are a mess of shell scripts disguised as YAML. People treat GitHub Actions or GitLab CI as a place to dump every bash command they can think of. This is how you end up in YAML-hell. You can’t local-test a GitHub Action easily. You end up pushing “test: fix typo” commits fifty times just to see if your if statement works.

The first rule of a sane CI/CD strategy: Keep the logic in the code, not the config. Your CI should just call a script or a Makefile that can be run locally. If I can’t run the exact same build command on my laptop as the CI runner, your pipeline is broken.

The Docker Image Bloat Problem

I see people using ubuntu:latest as a base image for a Go binary. Why? You’re shipping 200MB of vulnerabilities and unused libraries. Then they switch to alpine because it’s small, only to realize that musl vs glibc causes weird DNS resolution bugs or performance hits in Python. Use debian-slim. It’s the middle ground that won’t break your heart.

# BAD: The "I don't care about layers" approach
FROM python:3.9
COPY . /app
RUN pip install -r /app/requirements.txt
WORKDIR /app
CMD ["python", "main.py"]

# GOOD: Optimized for caching and security
FROM python:3.9-slim-bullseye AS builder
RUN apt-get update && apt-get install -y --no-install-recommends gcc python3-dev
COPY requirements.txt .
RUN pip install --user -r requirements.txt

FROM python:3.9-slim-bullseye
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
USER 1001
CMD ["python", "main.py"]

In the “Good” example, we use a multi-stage build. We install the heavy build tools (gcc) in the first stage and throw them away. The final image is smaller and has a reduced attack surface. Also, notice USER 1001. Running as root in a container is the “devops best” way to ensure a container escape turns into a full cluster compromise. Don’t do it.

  • Layer Caching: Always copy your dependency files (package.json, requirements.txt) before your source code. Source code changes every minute; dependencies change every week. Don’t invalidate your cache for a comment change.
  • Immutability: Never, ever use :latest. Use the git SHA or a semantic version. When a deployment fails, you need to know exactly what code is running. latest is a moving target that makes rollbacks impossible.

Infrastructure as Code: State is the Enemy

Terraform is the industry standard, and I hate it as much as I love it. The biggest mistake people make is creating one giant “monolith” state file. They put their VPC, their RDS instances, their EKS cluster, and their S3 buckets in one folder. One day, you try to update a tag on an S3 bucket, Terraform gets a 403 error from the AWS API, and it marks your entire RDS instance as “tainted.” Congratulations, you just deleted your database.

Blast Radius Reduction

You must split your state. Use remote_state data sources or Terragrunt to separate layers. Your VPC should be in its own state file. It changes once every six months. Your application-specific resources (like an SQS queue) should be in another.

# Example of a dangerous Terraform pattern
resource "aws_db_instance" "prod_db" {
  allocated_storage = 100
  engine            = "postgres"
  # ...

  lifecycle {
    prevent_destroy = true # This is your only line of defense. Use it.
  }
}

If you don’t have prevent_destroy = true on your stateful resources, you are playing Russian Roulette with your career. I’ve seen a junior dev run terraform apply with a typo in a variable that triggered a replacement of a production database. The prevent_destroy flag would have caught that in the plan phase.

The “Apply” Trap

Never run terraform apply from your local machine against production. Use a runner (GitHub Actions, Terraform Cloud, Atlantis). Why? Because your local machine has a different version of the AWS CLI, a different version of Terraform, and your cat might jump on the keyboard mid-apply. You need a consistent environment and a clear audit log of who ran what and when.

  • Locking: Use S3 with DynamoDB for state locking. If two people run apply at the same time without locking, your state file will corrupt. Recovering a corrupted .tfstate file is a form of torture prohibited by the Geneva Convention.
  • Variables: Stop hardcoding IDs. Use data blocks. If you hardcode subnet-0a1b2c3d, your code is useless in any other region or account.

Observability: You’re Paging People for Nothing

Most “devops best” guides tell you to monitor everything. CPU, Memory, Disk, Network. This is wrong. Monitoring CPU is mostly useless for modern, auto-scaling applications. If my CPU is at 90% but my latency is 50ms and my error rate is 0%, I don’t care. I’m sleeping.

Stop paging your SREs for “High CPU.” Page them for “High Latency” or “Increased Error Rate.” These are Service Level Indicators (SLIs). Everything else is just debugging data.

The Prometheus Cardinality Explosion

I once saw a Prometheus instance OOM-killed because a developer decided to add user_id as a label to a metric. We had 1 million users. That’s 1 million unique time series. Prometheus died. The whole monitoring system went dark because of one labels={"user_id": user.id} line.

# BAD: High cardinality
http_request_duration_seconds_bucket{method="GET", endpoint="/api/v1/user/12345"}

# GOOD: Low cardinality
http_request_duration_seconds_bucket{method="GET", endpoint="/api/v1/user/:id"}

Keep your labels bounded. If a label can have more than 100 possible values, it probably shouldn’t be a label. Use a logging system (like ELK or Loki) for high-cardinality data, not your metrics system.

The Golden Signals

If you’re starting from scratch, focus on the Four Golden Signals (from the Google SRE book, which is one of the few pieces of “hype” worth reading):

  1. Latency: The time it takes to service a request.
  2. Traffic: A measure of how much demand is being placed on your system.
  3. Errors: The rate of requests that fail.
  4. Saturation: How “full” your service is (e.g., thread pool limits).

Note to self: Dashboards are for looking at during an incident. Alerts are for waking you up. If an alert doesn’t require immediate action, it should be a weekly report, not a Slack notification.

Kubernetes: The Great Complexity Tax

Kubernetes is the default choice now, which is a tragedy. Most companies would be better off with a few well-configured Systemd units on a plain VM, but here we are. If you must use K8s, you need to respect the resources block.

I’ve seen clusters where requests were not set. The scheduler has no idea how much room is left on a node. It packs 50 pods onto a node that can only handle 10. Then, during a traffic spike, the node hits 100% memory, the kernel starts OOM-killing processes, and usually, it kills something vital like the kube-proxy or fluentd.

# A sane deployment spec
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  template:
    spec:
      containers:
      - name: app
        image: my-repo/api:v1.2.3
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20

The “Gotcha”: CPU limits can actually slow down your app. Kubernetes uses CFS throttling to enforce CPU limits. If your app is multi-threaded, it might hit the limit in a few milliseconds and then get “throttled” for the rest of the period, leading to massive latency spikes. Many SREs (myself included) often set requests.cpu but leave limits.cpu unset, or set it very high, while always strictly limiting memory.

Secret Management: Stop Putting API Keys in Git

It’s 2024 and I still find STRIPE_API_KEY=sk_live_... in public GitHub repos. “But it’s a private repo!” doesn’t matter. Your CI/CD system clones that repo. Every developer clones that repo. Your secrets are now on twenty different laptops.

Use a real secret manager. AWS Secrets Manager, HashiCorp Vault, or even encrypted secrets in your CI provider. If you are using Kubernetes, use something like External Secrets Operator to sync secrets from AWS/GCP into K8s Secrets.

Pro-tip: If you ever see a secret in a log file, stop what you are doing. Fix the logging configuration immediately. A leaked secret in a log aggregator (like Datadog or Splunk) is a nightmare to clean up because those logs are often immutable and replicated across regions.

The “Microservices” Tax

Everyone wants to be Netflix. But you don’t have Netflix’s problems. You have “I can’t find where the bug is” problems. When you split a monolith into 50 microservices, you haven’t removed complexity; you’ve just moved it to the network.

Now, instead of a function call, you have an HTTP request. That request can fail. It can timeout. It can be throttled. It can return a 502 because the load balancer is reconfiguring. If you don’t have Distributed Tracing (like Jaeger or Honeycomb), you are flying blind. You’ll see a 500 error in Service A, but the actual cause is a timeout in Service F, three hops away.

If you can’t explain why you need a microservice, you don’t need one. Build a “Modular Monolith” instead. It’s easier to deploy, easier to test, and significantly cheaper to run.

Security: The “Shift Left” Lie

The industry loves to say “Shift Left,” which is just a fancy way of saying “make developers do the security team’s job.” Developers are not security experts. If you just give them a list of 500 vulnerabilities from a Snyk scan, they will ignore all of them.

Instead of “shifting left,” provide Secure Defaults. Give them a base Docker image that is already hardened. Give them a Terraform module for an S3 bucket that has encryption and public access blocks enabled by default. Make the “right way” the “easy way.” If a developer has to go out of their way to make something insecure, they probably won’t do it.

The Real World: A “Gotcha” Only Experience Teaches

Here is something they don’t tell you in the AWS Certified Solutions Architect exam: DNS caching will ruin your life.

You have a service at api.stripe.com. Your application resolves that to an IP. Many runtimes (looking at you, Java) cache that DNS resolution *forever* by default. If Stripe changes their edge IP addresses for maintenance, your app will keep trying to talk to the old, dead IP. You’ll see “Connection Timeout” errors, your health checks will fail, and your service will restart.

Always check your TTL (Time To Live) settings. In Java, you need to set networkaddress.cache.ttl in the security policy. In Kubernetes, use CoreDNS and consider a local nscd or node-local-dns cache to prevent your DNS traffic from overwhelming the cluster’s DNS service.

The Wrap-up

DevOps isn’t about the latest tool or the most complex pipeline. It’s about building systems that are boring. Boring systems don’t break at 3 AM. Boring systems have clear logs, predictable scaling patterns, and simple deployment paths. Stop chasing the “best practice” of the week and start focusing on the fundamentals: idempotency, observability, and blast-radius reduction. If your automation makes a mistake, it should only break a small part of your world, and it should tell you exactly why it did it. Everything else is just hype.

Go delete a Jenkins job today. You’ll feel better.

Related Articles

Explore more insights and best practices:

Leave a Comment