Mastering Docker Compose: Simplify Multi-Container Apps

POST-INCIDENT AUDIT: REPORT #88-B (CRITICAL SYSTEM COMPROMISE)
DATE: 2024-10-14
AUDITOR: Senior Infrastructure Architect (Security/Hardening)
SUBJECT: The systematic failure of docker compose deployments in the “Alpha-Omega” staging environment.

Table of Contents

1. The Incident Log

The following is a raw dump from the host prod-srv-01 during the initial breach detection. The developer responsible claimed the setup was “standard.” I claim it was an invitation to a funeral.

# journalctl -u docker.service --since "2024-10-14 02:00:00"
Oct 14 02:10:15 prod-srv-01 dockerd[1102]: container_id=a4f2e... exec "curl http://169.254.169.254/latest/meta-data/iam/security-credentials/"
Oct 14 02:12:44 prod-srv-01 kernel: [10422.12] audit: type=1400 audit(1728871964.123:45): apparmor="DENIED" operation="mount" info="failed flags check" error=-13 profile="docker-default" name="/proc/" pid=14202 comm="python3"
Oct 14 02:15:01 prod-srv-01 dockerd[1102]: container_id=a4f2e... OOM kill detected. Memory limit exceeded.
# docker stats --no-stream
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     NET I/O           BLOCK I/O   PIDS
a4f2e8b1c0d9   web_app       185.20%   1.99GiB / 2GiB        12.4GB / 8.2GB    14MB / 0B   842
# iptables -L DOCKER -n -t nat
Chain DOCKER (2 references)
target     prot opt source               destination         
RETURN     all  --  0.0.0.0/0            0.0.0.0/0           
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:6379 to:172.18.0.3:6379

The post-mortem revealed a classic disaster. A developer used a standard docker compose file. They exposed Redis to the world. They didn’t set resource limits. They didn’t drop capabilities. An attacker hit the Redis port, used a known Lua script injection to gain shell access, and immediately started probing the cloud provider’s metadata service. The only reason we caught it was a poorly written cryptominer that tripped the OOM killer. We didn’t “win.” We got lucky.

2. The Teardown: Why Your YAML Is a Liability

The default behavior of docker compose is built for speed, not for survival. When you run docker compose up, you are handing the keys of your kernel to a set of binaries you likely haven’t audited.

First, the ports directive. Most developers think 6379:6379 means “open this port on the firewall.” No. It means “bypass the host’s ufw or firewalld and inject a DNAT rule directly into the iptables DOCKER chain.” Your host-level firewall is now irrelevant. If that container is running as root—which it is, by default—you have effectively bridged your internal memory store to the public internet with zero filtering.

Second, the networking. The default bridge network allows every container to talk to every other container. Why does the frontend need to reach the database’s management port? It doesn’t. But in a default docker compose stack, the blast radius is the entire subnet.

Third, capabilities. Linux kernels use capabilities to break down the “root” privilege into smaller pieces. By default, Docker grants containers things like NET_RAW (perfect for ARP spoofing) and MKNOD (creation of special files). Most applications need none of these. Leaving them active is negligence.

3. The Reconstruction: Hardening the Stack

We are going to rebuild this using docker compose v2.27.0. We are going to treat the host like a hostile environment and the containers like untrusted actors. We will not use “magic.” We will use explicit, restrictive configurations.

H2: The Fallacy of Default Bridge Networking

The first step is to kill the default network. We will define multiple networks with internal: true to ensure that back-end services cannot reach the outside world, even if the container is compromised.

networks:
  frontend_net:
    driver: bridge
    driver_opts:
      com.docker.network.bridge.name: br-frontend
  backend_net:
    internal: true
    driver: bridge
    driver_opts:
      com.docker.network.bridge.name: br-backend

By setting internal: true, Docker configures iptables to drop any packet leaving that bridge that isn’t destined for another container on the same bridge. This prevents exfiltration. If an attacker gains a shell on your database container, they can’t curl their command-and-control server. They are trapped in a dark room.

H2: Capability Leaks and the Root User Trap

Every container must run as a non-privileged user. No exceptions. But even then, we must strip the kernel privileges. In docker compose, we use cap_drop and security_opt.

services:
  app:
    image: our-hardened-python:3.12-slim
    user: "1000:1000"
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    security_opt:
      - no-new-privileges:true
      - seccomp:unconfined # Only if using a custom profile, otherwise leave default

cap_drop: [ALL] is the baseline. If your app needs to bind to port 80, you add back NET_BIND_SERVICE. Nothing else. The no-new-privileges:true flag is the most important line in the file. It prevents processes from gaining new privileges via setuid or setgid binaries. It stops a compromised low-privilege user from escalating to root within the container namespace.

H2: The Iptables Treachery and Port Binding

Stop binding to 0.0.0.0. If you must expose a port, bind it to a specific internal IP or 127.0.0.1 if you are running a local proxy like Nginx or HAProxy.

    ports:
      - "127.0.0.1:8080:8080"

This ensures the port is only accessible to the local host. If you need public access, you handle it at the edge, not at the container level. The interaction between docker compose and the host’s routing table is too opaque to trust with public-facing services.

H2: Resource Exhaustion as a Denial of Service

A container without limits is a time bomb. An attacker doesn’t need to steal data to win; they can just consume every CPU cycle or every byte of RAM, crashing the host. We use deploy configurations even in non-swarm mode (Compose V2 respects these).

    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M
    ulimits:
      nproc: 65535
      nofile:
        soft: 20000
        hard: 40000

Setting ulimits is vital. A fork bomb in a container can exhaust the host’s process table. By limiting nproc, we contain the explosion. We also set mem_limit to prevent the OOM killer from reaping critical host processes like sshd because a leaky Node.js app decided to eat 16GB of RAM.

H2: Orchestration Logic and Healthcheck Rigidity

Most developers use depends_on as a simple list. This is useless. It only checks if the container is started, not if it’s functional. We need to use the long-form depends_on with service_healthy conditions. This prevents the “thundering herd” problem where the app starts, fails to connect to the database, and enters a crash loop that fills the logs and consumes CPU.

    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d db"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

Then, in the application service:

    depends_on:
      db:
        condition: service_healthy

This forces docker compose to respect the actual state of the infrastructure. It ensures that the application doesn’t even attempt to start until the database is ready to accept connections.

H2: Filesystem Integrity and the Read-Only Mandate

A container’s root filesystem should be immutable. If an attacker gains access, they shouldn’t be able to install a rootkit, modify /etc/shadow, or drop a persistence script.

    read_only: true
    tmpfs:
      - /tmp
      - /run
      - /var/cache/nginx

By setting read_only: true, we turn the entire container into a read-only medium. Any attempt to write to the filesystem results in an error. For the few directories that require write access (like /tmp or pid files), we use tmpfs. This keeps the writes in memory, and they vanish the moment the container restarts. No persistence. No footprint.

4. The Hardened Spec

This is the final, audited docker-compose.yaml. It is not “easy” to use. It will break your “hot-reloading” developer workflows. It will require you to actually understand your application’s requirements. That is the point.

version: "3.9"

services:
  db:
    image: postgres:16-alpine
    container_name: hardened_db
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    networks:
      - backend_net
    volumes:
      - db_data:/var/lib/postgresql/data:rw
    secrets:
      - db_password
    deploy:
      resources:
        limits:
          memory: 1G
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      timeout: 5s
      retries: 5
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /run/postgresql
      - /tmp

  api:
    image: our-registry/api-service:v1.4.2
    container_name: hardened_api
    user: "1001:1001"
    depends_on:
      db:
        condition: service_healthy
    networks:
      - backend_net
      - frontend_net
    environment:
      DB_HOST: db
      DB_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /tmp
      - /run
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 2G
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  proxy:
    image: nginx:alpine
    container_name: hardened_proxy
    ports:
      - "127.0.0.1:80:80"
      - "127.0.0.1:443:443"
    networks:
      - frontend_net
    depends_on:
      api:
        condition: service_started
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
      - CHOWN
      - SETGID
      - SETUID
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /var/cache/nginx
      - /var/run
      - /tmp
    logging:
      driver: "syslog"
      options:
        syslog-address: "udp://127.0.0.1:514"
        tag: "nginx"

networks:
  frontend_net:
    internal: false
    driver: bridge
    driver_opts:
      com.docker.network.bridge.enable_icc: "false"
  backend_net:
    internal: true
    driver: bridge
    driver_opts:
      com.docker.network.bridge.enable_icc: "false"

volumes:
  db_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/secure_storage/postgres_data

secrets:
  db_password:
    file: ./secrets/db_password.txt

Analysis of the Hardened Spec

Isolation: I used com.docker.network.bridge.enable_icc: "false". This disables Inter-Container Communication by default. Even on the same bridge, containers cannot talk to each other unless explicitly linked or using DNS resolution. This is the “Zero Trust” model applied to the bridge.
Secrets Management: We are not using environment variables for passwords. POSTGRES_PASSWORD is a security hole; it shows up in docker inspect and /proc/1/environ. We use secrets, which mounts the password as a file in /run/secrets/.
Logging: The proxy logs to syslog. If an attacker wipes the container logs, the evidence is already on the remote logging server. The API uses a json-file driver with strict rotation to prevent disk exhaustion.
User Namespacing: Although not shown in the YAML (as it is a daemon.json setting), this configuration assumes the host has userns-remap enabled. This means “root” in the container is actually an unprivileged high-range UID on the host.
Volume Hardening: The database volume is a bind mount to a specific, encrypted partition (/mnt/secure_storage). We don’t trust Docker’s default volume management to handle data persistence.

5. The Warning

I have spent the last decade watching developers treat docker compose like a toy. They copy-paste YAML snippets from Stack Overflow and wonder why their infrastructure is a sieve. They prioritize “developer experience” and “velocity” while I am the one who has to explain to the board why our customer data is being sold on a Telegram channel.

The future of container orchestration is not looking better. We are moving toward more abstraction, more “serverless” layers that hide the underlying insecurity. People think that moving to the cloud solves these problems. It doesn’t. It just moves the iptables rules to a different API.

If you use this hardened spec, your developers will complain. They will say it’s “too hard” to debug. They will say they can’t “just exec in and fix things.” Good. They shouldn’t be “fixing things” in production. They should be building artifacts that are secure by design.

Every open port is an insult. Every default configuration is a back door. If you aren’t paranoid, you aren’t doing your job. You have been warned.

AUDIT COMPLETE.
STATUS: FAIL (Remediation Required)
SIGNATURE: [REDACTED]

Explore more insights and best practices: