Docker Best Practices: Build Faster and More Secure Images

The smell of burnt coffee is the only thing keeping my eyes open. It’s 04:15 AM. I’ve been staring at a Grafana dashboard that looks like a heart monitor for a patient in active cardiac arrest for the last three days. My PagerDuty alert didn’t just beep; it screamed. It screamed because some “Full Stack Rockstar” decided that pinning versions was for cowards and that resource limits were “suggestions.”

I’m tired. I’m cynical. I’ve spent seventy-two hours cleaning up a mess that could have been avoided if anyone in the engineering department had read a single page of documentation written after 2014. We’re running Docker Engine v24.0.7 on Debian Bullseye nodes, and yet, I’m seeing patterns that belong in a hobbyist’s “Hello World” project.

Here is the post-mortem. Read it. Internalize it. Or get out of my cluster.

Table of Contents

Timeline of the Failure

2024-05-12 03:00:01 UTC: node-04 reports 98% memory utilization. systemd-oomd begins monitoring.
2024-05-12 03:02:15 UTC: Kernel OOM Killer invokes out_of_memory. Victim: java_app_container.
2024-05-12 03:02:20 UTC: Scheduler attempts to restart container on node-05.
2024-05-12 03:05:40 UTC: node-05 network interface saturates (10Gbps) attempting to pull image:latest.
2024-05-12 03:10:12 UTC: Cascading failure. Five nodes are stuck in ImagePullBackOff because the 4.2GB image layers are thrashing the overlay2 storage driver.
2024-05-12 03:15:00 UTC: I am woken up by the sound of my career dying.

1. The Incident: 03:00 AM and the OOM Killer is Hungry

The logs don’t lie. When I finally got a shell into node-04, dmesg was a graveyard.

[10842.123456] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/kubepods/besteffort/pod123,task_memcg=/kubepods/besteffort/pod123,task=java,pid=12345,uid=0
[10842.123500] Memory cgroup out of memory: Killed process 12345 (java) total-vm:8421024kB, anon-rss:4194304kB, file-rss:0kB, shmem-rss:0kB

You see that uid=0? We’ll get to that. But look at the anon-rss. 4GB. The container had no limits set in the docker-compose.yml or the manifest. It just kept eating. In the world of Linux namespaces and cgroups, a container without a limit is a suicide pact. The kernel doesn’t care about your “critical business logic.” It sees a process hogging pages, and it executes it.

The “fix” isn’t just adding a mem_limit. It’s understanding how the JVM interacts with cgroups. If you don’t use -XX:+UseContainerSupport, the JVM looks at the host’s total memory, not the container’s limit. You’re lying to your app, and the kernel is the debt collector.

2. The Root Cause: 4GB Images and the “It Works on My Machine” Fallacy

I pulled the image that caused the saturation. docker inspect revealed a horror show.

"RootFS": {
    "Type": "layers",
    "Layers": [
        "sha256:7297da16...", 
        "sha256:8527027...",
        "sha256:..." // 42 more layers of pure incompetence
    ]
}

The developer used Ubuntu:latest as a base. Then they ran apt-get update and apt-get install in six different RUN commands. They left the apt cache in the image. They included the entire build-time toolchain—GCC, Python, Go, and probably a copy of the Oxford English Dictionary—in the production runtime image.

This is the “Layer Cake of Lies.” Every RUN command creates a new layer on the overlay2 filesystem. If you delete a file in a subsequent layer, it’s not gone; it’s just hidden by a “whiteout” file. The bits are still there, taking up space, slowing down docker pull, and increasing the attack surface.

When the cluster tried to recover, it had to move 4.2GB across the wire for every single pod. Our internal registry’s disk I/O spiked so hard the metadata database started throwing 500s. We weren’t just down; we were dead-locked.

3. Layer Optimization: Stop Rebuilding the World on Every Git Commit

I looked at the Dockerfile. It was a masterpiece of inefficiency.

The “Broken” Way (The Junior Dev Special)

FROM ubuntu:latest
# No version pinning. Good luck in six months.
RUN apt-get update
RUN apt-get install -y nodejs npm
COPY . /app
# This COPY is the kiss of death. 
# Any change to a README.md invalidates the cache for everything below.
RUN npm install
WORKDIR /app
CMD ["node", "server.js"]

Every time a dev changed a single line of CSS, the RUN npm install would trigger. That’s 500MB of node_modules being downloaded and compressed into a new layer. Every. Single. Time.

The “SRE Way” (Hardened Multi-stage)

To ensure long-term stability and keep the “docker best” practices alive, we use multi-stage builds and specific base images. We use Alpine 3.19 or Debian Bookworm-slim. We pin versions. We respect the cache.

# Stage 1: Build
FROM node:20.11.0-bookworm-slim AS builder
WORKDIR /app
# Only copy files needed for dependency resolution
COPY package.json package-lock.json ./
RUN npm ci --only=production

# Stage 2: Runtime
FROM node:20.11.0-bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    dumb-init=1.2.5-2 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
# Copy only the artifacts from the builder
COPY --from=builder /app/node_modules ./node_modules
COPY . .

# Security: Don't run as root
USER node
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["node", "server.js"]

By separating the build environment from the runtime environment, we dropped the image size from 4.2GB to 180MB. We used npm ci for deterministic builds. We cleaned up the apt lists in the same RUN command to prevent layer bloat. This isn’t just “optimization”; it’s survival.

4. The Security Gap: Why You’re Running as Root and Why I’m Revoking Your SSH Access

During the incident, I noticed something even more disturbing. One of the compromised containers had a shell history that wasn’t mine. Because the container was running as root (the default in Docker if you’re lazy), a simple remote code execution (RCE) vulnerability in the web app gave the attacker a root shell inside the container.

From there, they checked for the existence of /var/run/docker.sock. And guess what? Some “DevOps Engineer” had mounted it so the container could “manage other containers.”

Mounting the Docker socket is equivalent to giving the container sudo access to the host without a password. It’s a container escape waiting to happen. An attacker can just run docker run -v /:/host -it ubuntu:latest chroot /host and they own the entire node. They own the kernel. They own the data.

We use seccomp profiles and AppArmor. We restrict syscalls. Does your app really need mount(), ptrace(), or kexec_load()? No? Then why are they available to your container? Docker Engine v24 ships with a default seccomp profile, but it’s broad. A “docker best” approach involves profiling the app and dropping every capability except NET_BIND_SERVICE.

And for the love of all that is holy, stop using latest. latest is not a version. It’s a moving target. It’s a rolling dice. When you pull latest, you have no idea what code is actually running. You can’t roll back because latest today is different from latest yesterday. Use SHA256 hashes if you’re serious, or at least specific SemVer tags.

5. Signal Handling: Why Your App Won’t Shut Down Gracefully (PID 1 Blues)

During the meltdown, I tried to restart the services. docker stop took 10 seconds for every container. Why? Because your app is running as PID 1 and it’s deaf to signals.

When you use the “shell form” of CMD or ENTRYPOINT:
CMD node server.js

Docker executes it as /bin/sh -c "node server.js". The shell starts as PID 1. When Docker sends a SIGTERM to the container, it goes to the shell. The shell, being a stubborn piece of 1970s technology, does not forward that signal to the child process (node). The app keeps running, unaware that the reaper is coming. After 10 seconds, Docker loses patience and sends SIGKILL.

SIGKILL is the nuclear option. It doesn’t allow the app to close database connections, flush buffers, or finish processing a request. It just stops the CPU in its tracks. This leads to database corruption and “zombie” records.

The fix is the “exec form”:
CMD ["node", "server.js"]

Or better yet, use dumb-init or tini. These are tiny init systems designed to run as PID 1, reap zombie processes, and correctly forward signals. If your app doesn’t shut down in under a second, you’ve failed.

6. The “docker best” Manifesto: A Checklist for People Who Don’t Want to Get Fired

I’m going to go get another coffee. While I’m gone, you’re going to rewrite your docker-compose.yml and your Dockerfiles. If I see another 4GB image or a container running as root, I’m revoking your production access and moving you to the documentation team.

Here is the docker-compose.yml that I expect to see. It uses version: '3.8' syntax (or the newer Compose Specification) and defines strict resource constraints.

services:
  api-gateway:
    image: our-registry.io/api-gateway:v2.4.1@sha256:a1b2c3d4...
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M
    restart: on-failure:3
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /tmp
      - /run
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

The Checklist

No Root Users: Always define a USER in your Dockerfile. If you need to bind to a port below 1024, use CAP_NET_BIND_SERVICE, not root.
Explicit Resource Limits: If you don’t set a memory_limit, you are volunteering to be the first person I call at 3 AM.
Multi-Stage Builds: If your production image contains a compiler, you’ve done it wrong.
Small Base Images: Use Alpine or distroless. If you use Ubuntu, use the -slim variant and clean up your mess.
Immutable Tags: latest is a crime. Pin your versions. Pin your SHA hashes.
Read-Only Filesystems: Use read_only: true in your compose file. If your app needs to write to a temp directory, use a tmpfs mount. This kills 90% of automated exploit payloads.
Signal Handling: Use exec form for CMD. Use an init process like tini.
The .dockerignore File: Stop sending your .git folder and node_modules to the Docker daemon. It slows down the build and leaks secrets.

I’ve spent a decade in the Linux terminal. I’ve seen ext4 filesystems dissolve like cotton candy in the rain. I’ve seen iptables rules that would make a cryptographer weep. Docker is a tool, not a magic wand. If you treat it like a “black box” where you can just throw your garbage code and expect it to run forever, you are the problem.

The cluster didn’t scream because of a bug in Docker. It screamed because of you. It screamed because it was bloated, insecure, and unmanaged.

Go fix your images. I’m going to sleep. If my pager goes off again because of an OOM event, I’m not fixing the cluster—I’m fixing the hiring process.

Deep Dive: The Overlay2 Storage Driver and Why It Hates You

Since you’re still here, let’s talk about why your 4GB image actually breaks the disk. Docker uses the overlay2 driver. It works by stacking “lower” directories and a single “upper” directory. When you write a file, it uses “copy-on-write” (CoW).

If you have a 1GB log file in a lower layer and you run chmod 777 on it in your Dockerfile, overlay2 has to copy that entire 1GB file to the new layer just to change the permission bits. You now have 2GB of disk usage for a 1GB file. This is why we chain commands:

# WRONG
RUN wget http://huge-file.tar.gz
RUN tar -xvf huge-file.tar.gz
RUN rm huge-file.tar.gz

# RIGHT
RUN wget http://huge-file.tar.gz && \
    tar -xvf huge-file.tar.gz && \
    rm huge-file.tar.gz

In the “RIGHT” example, the file is downloaded, extracted, and deleted in a single layer. The overlay2 driver never has to commit the .tar.gz to disk permanently. This is “docker best” practice 101, and yet, I see people failing it every single day.

And don’t get me started on storage-opts. We’re running overlay2.override_kernel_check=true because we know our kernel (6.1 LTS) can handle it. We’ve tuned the xfs backing store for the /var/lib/docker partition with pquota to prevent a single container from filling the entire host’s disk with logs. Did you know you could do that? No, you were too busy adding another RUN command to your Dockerfile.

The Final Word on Seccomp and Syscalls

If you really want to impress me, stop looking at Docker as a way to “package apps” and start looking at it as a way to “sandbox processes.”

A container is just a process with a fancy hat. It still talks to the same kernel. Every time your app makes a syscall—open(), read(), write(), socket()—it’s an opportunity for something to go wrong. By using a custom seccomp profile, you can restrict the process so it can only do what it’s supposed to do.

If your Node.js app starts trying to call execve() to run a shell script it just downloaded into /tmp, a good seccomp profile will kill the process instantly. That’s the difference between a “security incident” and a “blocked syscall” log entry.

I’m done. The sun is coming up. The cluster is stable, for now. Don’t make me come back here. Fix your habits. Respect the cgroups. And never, ever use latest again.

Explore more insights and best practices: