Master Docker Compose: Simplify Multi-Container Workflows

text
[2024-05-24T03:14:22.891Z] ERROR: worker-node-04 kernel: [192834.12] Out of memory: Killed process 28491 (python3) total-vm:4.2GB, anon-rss:3.8GB, file-rss:0B, shmem-rss:0B, uid:1000 pgtables:8420kB oom_score_adj:0
[2024-05-24T03:14:23.002Z] CRITICAL: container_id=f3a2b1c0d9e8 exited with code 137.
[2024-05-24T03:14:23.450Z] DEBUG: Attempting manual restart of service ‘api-gateway’…
[2024-05-24T03:14:23.501Z] ERROR: docker: Error response from daemon: driver failed programming external connectivity on endpoint api-gateway (hash): Bind for 0.0.0.0:8080 failed: port is already allocated.
[2024-05-24T03:14:24.110Z] FATAL: Network collision detected on bridge br-f92a11c82. Context propagation failed. Race condition in manual container deployment.

It is 04:30 AM on a Sunday. I have been staring at a flickering terminal for forty-eight hours because one of you "cowboy" developers decided that writing a declarative configuration was too much work for a "quick fix" on Friday afternoon. You thought running a raw `docker run` command directly on the staging host was a shortcut. Instead, you created a cascading failure that triggered an OOMKill, wiped out the ephemeral storage of three sidecars, and left me untangling a web of orphaned network namespaces that looked like a bowl of digital spaghetti.

I am done. My patience has evaporated along with my weekend. This isn't a suggestion. This isn't a "best practice" shared over a latte. This is a mandatory shift in how we operate. As of Docker Engine 26.1.4, we are officially banning the use of raw `docker` commands for anything other than `ps` or `logs`. If I catch another person manually mapping ports or injecting environment variables via the CLI, I will revoke your SSH access faster than the kernel can reclaim leaked memory.

We are moving to `docker compose` v2.27.0. Exclusively. Here is why your manual workflow is a liability and why this manifesto is the only thing standing between you and a formal performance review.

## The Myth of the "One-Liner" and the Death of Reproducibility

You think you’re being fast. You type `docker run -d -p 8080:80 --name my-app my-image:latest` and walk away. But you’ve just committed a crime against reproducibility. Where is the record of that command? It’s buried in your shell history, which will be purged. It’s not in version control. It’s not peer-reviewed.

When that container crashed at 3 AM, I had no idea what environment variables you passed to it. I didn't know if you set memory limits (you didn't, which is why the OOMKiller nuked the entire node). I didn't know which network you attached it to. Using `docker compose` forces you to define the state of the world in a YAML file that lives in Git. It turns your "tribal knowledge" into a documented, executable reality.

Without a `docker-compose.yml`, we are guessing. And in SRE, guessing is just a slow way of failing.

## Why Your Manual Port Mapping is a Security Nightmare

Last night, the "Friday Night Massacre" happened because two of you tried to run different versions of the same microservice on the same host using manual `docker run` commands. One of you mapped `8080:80`. The other tried `8080:80` and failed, so you "cleverly" changed it to `8081:80`. 

You didn't realize that the application logic was hardcoded to look for a specific port on the bridge gateway. By bypassing `docker compose`, you bypassed the internal DNS resolution that `docker compose` provides by default. In a `docker compose` environment, services talk to each other by service name over an isolated virtual bridge. You don't need to expose ports to the host interface at all unless it's the actual ingress point.

When you use raw commands, you end up exposing internal databases to `0.0.0.0` just so your worker container can find them. You’re opening the front door to the entire internet because you’re too lazy to define a network namespace.

## The YAML Indentation Hell We Deserve (And Why It’s Better)

Look at this mess. This is what I found on the staging server—a shell script trying to mimic what `docker compose` does natively. It’s fragile, it’s ugly, and it failed the moment a container didn't exit cleanly.

### Code Block 1: The "Cowboy" Failure (What NOT to do)
```bash
# This is garbage. Do not do this.
docker run -d --name redis-prod redis:7.0
docker run -d --name python-worker \
  -e REDIS_HOST=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' redis-prod) \
  --memory="2g" \
  my-python-app:v1.2
# Result: If redis-prod restarts and gets a new IP, python-worker breaks. 
# There is no healthcheck. There is no retry logic.

The above is why I was paged. The Redis container restarted, the IP changed, and the Python worker spent four hours screaming into the void because it had a hardcoded IP address injected at runtime.

Now, look at how we are doing it moving forward. We are using docker compose to handle the service discovery. The internal DNS server built into the Docker daemon handles the mapping of the service name redis to whatever ephemeral IP the container currently holds.

Environment Variable Hell: Why Your Shell History is a Security Breach

I found secrets in the ~/.bash_history of the service account. Why? Because someone ran docker run -e DB_PASSWORD=SuperSecretPassword123. This is amateur hour.

With docker compose, we integrate with .env files and secret management. We don’t leak credentials into the process tree where any ps aux command can scrape them. We use the env_file attribute to keep our configuration separate from our execution. This allows us to swap environments (dev, staging, prod) without changing a single line of the core logic.

Furthermore, docker compose allows for interpolation. We can ensure that the image tags are consistent across the entire stack. No more “oh, I forgot to update the sidecar image version” excuses.

The Race Condition Symphony: depends_on or Die

The most infuriating part of the 48-hour shift was the circular dependency between the Redis cache and the Python worker. The worker would start, try to connect to Redis, find that Redis was still “booting,” and then crash-loop. Because the worker had no internal retry logic (another thing we need to discuss), it eventually hit the kernel’s rate limit for process spawning and triggered a system-wide hang.

Raw docker has no concept of “readiness.” It only knows “running.” A container can be “running” while the application inside is still JIT-compiling or waiting for a socket.

Code Block 2: The “Intermediate” Transition

# docker-compose.yml - Docker Compose version v2.27.0
services:
  redis:
    image: redis:7.2-alpine
    networks:
      - backend_net
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  worker:
    image: our-registry.io/python-worker:2.4.1
    environment:
      - REDIS_HOST=redis
    depends_on:
      redis:
        condition: service_healthy
    networks:
      - backend_net

networks:
  backend_net:
    driver: bridge

In this configuration, docker compose understands the state of the application, not just the container. The worker will not even attempt to start until the redis healthcheck returns a successful exit code. This eliminates the race condition that cost me my Saturday night.

Project Isolation and the Networking Abyss

When you run docker compose up, the tool automatically creates a network prefixed with your project name. This is critical. It means that “Project A” cannot accidentally talk to “Project B” just because they happen to be on the same host.

When you use raw docker run, you are likely dumping everything into the default bridge network. This is a flat network where every container can see every other container. It’s a lateral movement dream for an attacker and a debugging nightmare for an SRE.

I spent three hours yesterday trying to figure out why a staging container was receiving traffic meant for a legacy dev instance. It turns out they were both listening on the same internal bridge and the load balancer was round-robining between them because they both responded to the same alias. docker compose prevents this by scoping everything to the project.

Volume Persistence and the Ephemeral Storage Lie

One of you “cowboys” lost 40GB of processed logs because you used a bind mount incorrectly in a manual command. You pointed the container to a directory that didn’t exist on the host, and Docker, in its infinite wisdom (Engine 26.1.4), created it as a root-owned directory. When the application tried to write to it, it failed, fell back to the container’s ephemeral layer, and when the container was deleted, so was the data.

docker compose makes volume management explicit. We use named volumes. We define them at the top level. We ensure they are managed by the Docker volume driver, not by some random path on your local disk that won’t exist in the staging environment.

Mandatory Hardening: The Final Standard

This is the final form. Every service we deploy from now on must look like this. If it doesn’t have resource limits, if it doesn’t have healthchecks, and if it isn’t using docker compose, it will be deleted by a cron job I am writing tonight.

Code Block 3: The Production-Hardened Standard

version: "3.9" # Compatible with docker compose v2.27.0

services:
  api:
    image: our-registry.io/api-service:v4.12.0
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    networks:
      - frontend
      - backend
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  db:
    image: postgres:16-alpine
    volumes:
      - db_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_DB_FILE=/run/secrets/db_name
      - POSTGRES_USER_FILE=/run/secrets/db_user
      - POSTGRES_PASSWORD_FILE=/run/secrets/db_password
    secrets:
      - db_name
      - db_user
      - db_password
    networks:
      - backend
    deploy:
      resources:
        limits:
          memory: 1G

networks:
  frontend:
    internal: false
  backend:
    internal: true

volumes:
  db_data:
    driver: local

secrets:
  db_name:
    file: ./secrets/db_name.txt
  db_user:
    file: ./secrets/db_user.txt
  db_password:
    file: ./secrets/db_password.txt

Notice the internal: true flag on the backend network. This ensures that the database has zero path to the outside world. It can only talk to the API. This is the level of isolation I expect. Notice the resources block. This prevents the OOMKill 137 errors that have been haunting my pagers. If your app leaks memory, it dies alone, without taking the whole node with it.

The “I Don’t Care About Your Local Machine” Clause

I am tired of hearing “it worked on my machine.” Your machine is a MacBook with 32GB of RAM and a different architecture. Staging is a hardened Linux environment with strict cgroup constraints.

By using docker compose, we can use multiple override files. You can have your docker-compose.override.yml for your local development—keep your debug ports, keep your hot-reloading bind mounts. But the base docker-compose.yml must be the source of truth for how the application behaves in the wild.

If you cannot run docker compose up and have the entire stack initialize correctly, your code is broken. Period. No more manual setup steps. No more “oh, you have to run this script first to seed the DB.” Use an init container or an entrypoint script defined in the YAML.

Terminal Verification: The Only Way to Work

From this point forward, before you even think about pushing a change, you will verify your local state using the following commands. I will be checking the logs.

# Verify all services are healthy, not just "running"
$ docker compose ps

NAME                IMAGE               COMMAND                  SERVICE             STATUS              PORTS
stack-api-1         api-service:v4.12   "docker-php-entrypoi…"   api                 running (healthy)   0.0.0.0:80->3000/tcp
stack-db-1          postgres:16         "docker-entrypoint.s…"   db                  running (healthy)   5432/tcp

# Check for resource-related warnings
$ docker compose logs --tail=100 | grep -iE "error|critical|oom|fail"

If docker compose ps shows anything other than (healthy), you do not merge. If you see a 137 exit code in the logs, you do not ask me for help until you have added memory limits and profiled your heap usage.

The Cost of Your Inefficiency

Do you know what happens when a node goes down because of an OOMKill? It’s not just one container. The kernel gets desperate. It starts killing processes based on an oom_score. Often, it kills the SSH daemon or the monitoring agent before it hits your bloated Python script. This leaves the node in a “zombie” state—it’s alive, but unreachable. I have to manually power-cycle the instance via the cloud console, wait for the EBS volumes to detach (which always takes forever), and then rebuild the local cache.

This process takes 45 minutes. 45 minutes of downtime because you couldn’t be bothered to write twenty lines of YAML.

We are SREs. Our job is to manage entropy. Your job, as developers, is to stop creating it. docker compose is the tool that bridges that gap. It provides the context propagation we need to understand how services relate to one another. It ensures that ephemeral storage is handled correctly. It prevents the race conditions that turn a simple deployment into a weekend-long nightmare.

Mandatory Action Items

  1. Audit: Every repository must have a docker-compose.yml by EOD Tuesday.
  2. Cleanup: Run docker system prune -f on your dev machines to clear out the hundreds of orphaned volumes and networks your manual commands have left behind.
  3. Standardize: Use the hardened template provided in Code Block 3. No exceptions.
  4. Education: If you don’t understand how depends_on works with healthchecks, read the documentation for docker compose v2.27.0. Do not ask me. I am going to sleep.

If I see another docker run command in our CI/CD pipelines or in the shell history of any staging server, I will personally ensure that your next “quick fix” is reviewed with a microscope and a blowtorch.

We are professionals. Start acting like it. Use docker compose. Stop the bleeding.

I’m going home. Don’t page me unless the data center is literally on fire. And even then, check the healthchecks first.

SRE Lead Out.

Related Articles

Explore more insights and best practices:

Leave a Comment