10 Docker Best Practices to Optimize Your Containers

It is 05:42 AM. I have consumed four double-espressos, two cold slices of pepperoni pizza, and enough adrenaline to stop a rhino’s heart. The production cluster is finally stable, no thanks to the “optimization” PR merged by a junior developer who thought they knew better than the last ten years of containerization history.

I’m writing this because if I see one more Dockerfile that starts with FROM ubuntu:latest, I am going to decommission my own biological life support system. This isn’t a “thought leadership” piece. This is a survival guide written in the blood of my sleep cycle. We are going to talk about docker best practices, not because they make the YAML look pretty, but because they prevent the 3:00 AM phone call that makes me want to throw my pager into a woodchipper.

1. THE INCIDENT REPORT: THE NIGHT THE REGISTRY DIED

[2024-05-14 03:02:11] PagerDuty triggers. CRITICAL: KubeNodeDiskPressure on the us-east-1 production-a cluster.
[2024-05-14 03:05:45] CRITICAL: KubePodNotReady. The api-gateway service is failing readiness probes.
[2024-05-14 03:10:22] I’m online. kubectl get nodes shows three nodes in NotReady state. df -h on the nodes shows /var/lib/docker is at 100% capacity.
[2024-05-14 03:15:00] Attempting to purge unused images with docker image prune -a. It’s taking forever because the I/O wait is through the roof.
[2024-05-14 03:22:18] I find the culprit. A new deployment of the user-profile-service. The image size is 4.2GB. For a Python microservice.
[2024-05-14 03:30:45] The Horizontal Pod Autoscaler (HPA) is trying to spin up more pods to handle the failing requests. Each pod pull is 4.2GB. The internal container registry is now screaming under a 40Gbps egress load.
[2024-05-14 03:45:12] Registry enters a crash loop. All deployments across the entire company are now paralyzed.
[2024-05-14 04:00:00] I manually kill the deployment and roll back to the previous version (which was 220MB).
[2024-05-14 04:30:00] Nodes are still stuck. The overlay2 storage driver is struggling to clean up the massive layers. I have to manually rm -rf /var/lib/docker/overlay2/* on three nodes and rejoin them to the cluster.
[2024-05-14 05:30:00] Traffic normalizes. The “optimization” was a junior dev adding apt-get install -y cuda-toolkit “just in case we need AI features later” and forgetting to clean the cache.


2. THE AUTOPSY: ANATOMY OF A DISASTER

I ran a docker history --no-trunc on that 4.2GB monstrosity. Look at this. Look at it and weep.

IMAGE          CREATED          CREATED BY                                                                                                                                                                                                                                                                                                                                                                                          SIZE      COMMENT
<missing>      3 hours ago      COPY . . # buildkit                                                                                                                                                                                                                                                                                                                                                                                 1.8GB     
<missing>      3 hours ago      RUN /bin/sh -c apt-get update && apt-get install -y python3-pip git vim curl wget build-essential cmake cuda-toolkit-12-1 # buildkit                                                                                                                                                                                                                                                                2.1GB     
<missing>      3 hours ago      RUN /bin/sh -c pip install -r requirements.txt # buildkit                                                                                                                                                                                                                                                                                                                                           300MB     
<missing>      4 weeks ago      /bin/sh -c #(nop)  CMD ["/bin/bash"]                                                                                                                                                                                                                                                                                                                                                                0B        
<missing>      4 weeks ago      /bin/sh -c #(nop) ADD file:702967679848520898520984520984520984520984520984520984520984520 in /                                                                                                                                                                                                                                                                                                    77.8MB    

The COPY . . command pulled in the entire .git directory (800MB), the local venv (600MB), and a bunch of raw .csv test data (400MB) because the developer didn’t know what a .dockerignore file was. The apt-get layer is a crime against humanity. They didn’t use --no-install-recommends, and they didn’t clean up /var/lib/apt/lists/.

This is why we can’t have nice things.


3. THE COMMANDMENTS OF SANITY

H2: STOP USING ‘LATEST’ BEFORE YOU KILL US ALL

If you use FROM python:latest or FROM node:latest, you are playing Russian Roulette with the production environment. “Latest” is a moving target. When the Python maintainers decide to switch the base image from Debian Bullseye to Bookworm, or change the default OpenSSL version, your build is going to break. Or worse, it’s going to build successfully but fail at runtime with a GLIBC_2.34 not found error.

You must use specific, immutable tags. Not just python:3.11. Use python:3.11.9-slim-bookworm. This tells me exactly what the runtime environment is, what the underlying OS is, and it ensures that when I rebuild this image in six months, I get the same result. This is the first rule of docker best practices: reproducibility is not optional.

H2: YOUR LAYERS ARE FAT AND YOU SHOULD FEEL BAD

Docker images are a stack of read-only layers. Every RUN, COPY, and ADD instruction creates a new layer. If you do this:

RUN apt-get update
RUN apt-get install -y heavy-package
RUN rm -rf /var/lib/apt/lists/*

You have failed. The heavy-package is still there in the second layer. Deleting it in the third layer doesn’t remove it from the image; it just hides it in the top layer’s view. The bits are still being pushed to the registry.

You must chain your commands:

RUN apt-get update && apt-get install -y \
    --no-install-recommends \
    heavy-package \
    && rm -rf /var/lib/apt/lists/*

And for the love of everything holy, use multi-stage builds. Your runtime image doesn’t need gcc, make, git, or the headers for every library you compiled. Build your wheels in a builder stage, then copy only the artifacts to a slim or distroless final stage.

H2: ROOT IS FOR GOD AND IDIOTS, AND YOU AREN’T GOD

Running your application as root inside a container is a firing offense in my book. If there is a container escape vulnerability (and there will be), you’ve just handed the attacker a golden ticket to the host kernel.

Modern namespaces and cgroups v2 provide isolation, but they aren’t magic. If your process is UID 0, and it manages to break out via a shocker style exploit or a misconfigured volume mount to /var/run/docker.sock, the game is over.

Always create a system user and switch to it:

RUN groupadd -g 10001 appuser && \
    useradd -u 10000 -g appuser appuser
USER 10000

Notice I used explicit UIDs. This prevents collisions with the host system and makes it easier to manage file permissions on persistent volumes.

H2: SIGNAL PROPAGATION OR: WHY YOUR APP TAKES 30 SECONDS TO DIE

When Kubernetes sends a SIGTERM to your pod, it’s politely asking your app to finish its current request, close database connections, and exit. If your app is running as a sub-process of a shell script, it will never see that signal.

If you use CMD my-app.sh, Docker runs it as /bin/sh -c my-app.sh. The shell (PID 1) does not forward signals to its children. When the 30-second terminationGracePeriodSeconds expires, the kernel sends a SIGKILL, which is like pulling the power plug. This leads to corrupted state and broken transactions.

Use the “exec form”: ENTRYPOINT ["/usr/bin/my-app"]. This runs your app as PID 1. If your app can’t handle being PID 1 (e.g., it doesn’t reap zombie processes), use tini as your entrypoint. It’s a tiny init binary designed exactly for this.

H2: THE .DOCKERIGNORE FILE IS NOT A SUGGESTION

When you run docker build ., the first thing the CLI does is send the “build context” to the Docker daemon. If you have a 2GB node_modules folder or a .git directory with five years of history, you are sending gigabytes of useless data over the socket before the build even starts.

A proper .dockerignore is a docker best practice that saves hours of CI/CD time. It should include:
.git
**/node_modules
**/__pycache__
*.log
.env (don’t you dare bake secrets into your image)
tmp/

H2: BUILDKIT IS NOT A SUGGESTION, IT IS THE LAW

If you aren’t using DOCKER_BUILDKIT=1, you are living in the stone age. Modern BuildKit allows for COPY --link, which is a game-changer.

Normally, if you change a file in an early layer, every subsequent layer must be rebuilt because the hash of the underlying filesystem changed. COPY --link puts files into a separate snapshot and merges them into the final image without depending on the previous layers’ state. It means if you change your app code, but your dependencies haven’t changed, Docker can just swap the app layer without re-calculating anything.

Also, use RUN --mount=type=cache. This allows you to persist your pip cache or npm cache between builds without bloating the final image. It’s the difference between a 10-minute build and a 30-second build.


4. THE MECHANICS OF FAILURE: CGROUPS AND KERNEL PANIC

Let’s talk about why that 4.2GB image actually killed the nodes. It wasn’t just the disk space. It was the memory overhead and the way the Linux kernel handles overlay2.

When you pull a massive image, the dockerd process has to decompress those layers. This consumes CPU and memory. On a node already under pressure, this triggers the OOM (Out Of Memory) killer. But the OOM killer is a blunt instrument. It might kill the kubelet instead of the offending docker pull process.

Furthermore, we are running on cgroups v2. When you don’t set memory limits on your containers, or when your runtime (like Java 8 or older Node versions) isn’t cgroup-aware, the process sees the total host memory. It tries to allocate a heap based on 128GB of RAM while the container is actually restricted to 4GB by the orchestrator. The result? A constant cycle of CrashLoopBackOff.

The junior dev’s image also didn’t have a HEALTHCHECK instruction. Kubernetes was relying on TCP socket checks, which passed because the container was “running,” even though the application inside was stuck in a python-magic library dependency hell caused by the bloated apt-get install.


5. THE “FIX IT” DIFF

Here is the “Before” (The Crime) and the “After” (The Sanity).

THE CRIME (Before)

# This is garbage. Do not do this.
FROM python:latest

# No .dockerignore, so this pulls in 2GB of junk
COPY . /app
WORKDIR /app

# Massive layer, no cleanup, installs unnecessary compilers
RUN apt-get update && apt-get install -y git build-essential vim cmake cuda-toolkit-12-1
RUN pip install -r requirements.txt

# Runs as root. Dangerous.
# Uses shell form, so SIGTERM is ignored.
CMD python main.py

THE SANITY (After)

# Use a specific, small base image. 
# python:3.11-slim-bookworm is ~120MB vs 1GB+ for the full image.
FROM python:3.11.9-slim-bookworm AS builder

# Set environment variables for Python
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

WORKDIR /build

# Install build dependencies in the builder stage only
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

# Use a cache mount for pip to speed up builds
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip wheel --no-deps --wheel-dir /build/wheels -r requirements.txt

# --- Final Stage ---
FROM python:3.11.9-slim-bookworm

# Create a non-privileged user
RUN groupadd -g 10001 appgroup && \
    useradd -u 10000 -g appgroup -m -s /bin/bash appuser

WORKDIR /app

# Copy only the wheels from the builder
COPY --from=builder /build/wheels /wheels
RUN pip install --no-cache /wheels/*

# Use --link to optimize cache invalidation
COPY --link --chown=10000:10001 . .

# Switch to non-root user
USER 10000

# Use tini to handle signals correctly
# This is a docker best practice for signal propagation
RUN apt-get update && apt-get install -y tini && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini", "--"]

# Exec form of CMD
CMD ["python", "main.py"]

6. WHY WE USE MULTI-STAGE BUILDS (THE DEEP DIVE)

I need you to understand the “why” here. In the “After” Dockerfile, the builder stage is where all the mess happens. We install gcc, we compile C-extensions for Python libraries, and we create a bunch of temporary files.

When we transition to the second FROM instruction, we start with a fresh, clean slate. The gcc compiler, the headers, and the apt cache are all left behind in the first stage. We only COPY the pre-compiled wheels. This is how you get an image from 4GB down to 150MB.

Smaller images mean:
1. Faster Pulls: During a scale-up event, 150MB pulls in seconds. 4GB pulls in minutes (or fails).
2. Reduced Attack Surface: If a hacker gets into your container, they won’t find git, gcc, or curl to help them move laterally through your network.
3. Lower Costs: We pay for S3 storage for the registry and for data transfer. Bloated images are literally burning company money.


7. THE FINAL WORD ON CACHING

Docker’s layer caching is based on the order of operations. If you COPY . . at the top of your Dockerfile, any change to any file in your repository invalidates the cache for every single line below it.

That is why we copy requirements.txt (or package.json) first, install the dependencies, and then copy the rest of the source code. Dependencies change much less frequently than code. By separating them, 90% of your builds will use the cached dependency layer, making your CI/CD pipeline actually usable instead of a bottleneck.

I am going to sleep now. If I wake up and see a FROM ubuntu:latest in the PR queue, I am revoking everyone’s sudo access and we are going back to deploying via FTP on shared hosting.

Do better. For the sake of my sanity and the uptime of this company.

Signed,
The SRE who seen too many layers.

Related Articles

Explore more insights and best practices:

Leave a Comment