text
[2024-05-22T03:14:02.881Z] ERROR: Container “api-gateway” exited with code 137 (OOMKilled)
[2024-05-22T03:14:05.112Z] CRITICAL: Service “auth-provider” failed to bind to 0.0.0.0:8080. Address already in use.
[2024-05-22T03:14:05.115Z] FATAL: Dependency check failed. “postgres-db” not reachable at 172.17.0.2:5432.
[2024-05-22T03:14:05.118Z] STACK_TRACE: deploy.sh: line 44: docker run -d –name api-gateway …
[2024-05-22T03:14:05.120Z] SYSTEM_STATE: Load Average 45.12, 38.01, 22.10. Disk I/O 98% saturated.
[2024-05-22T03:14:05.122Z] TERMINATING: Manual cleanup required. God is dead and we killed Him with a shell script.
# The YAML That Bit Back: A Post-Mortem on Orchestration Laziness
I’ve been awake for 48 hours. My blood is 60% espresso and 40% spite. While the rest of the engineering team was dreaming of "clean code" and "agile velocity," I was watching our production environment melt into a puddle of unrouted packets and orphaned processes. The culprit? A "clever" bash script named `deploy.sh`, written by a junior developer who thought `docker compose` was "too much overhead for a simple microservice stack."
We are running Docker Engine v25.0.3. We have the tools. We have the specs. And yet, I spent my Sunday morning manually killing zombie containers because someone decided that a sequence of `docker run` commands was a viable orchestration strategy. It isn’t. It’s a suicide note written in Bourne Shell.
## 1. The Bash Script Death Spiral
The incident began when the `deploy.sh` script attempted to update the `api-gateway`. In a sane world—a world governed by `docker compose`—this would be an atomic operation. Instead, the script executed a `docker stop api-gateway` followed by a `docker rm api-gateway`. Between those two commands, the health check for our load balancer failed, the auto-scaler panicked, and the script continued blindly to the next line.
Manual `docker run` commands are a professional liability because they lack state awareness. When you execute `docker run`, you are throwing a binary at the kernel and hoping it sticks. There is no source of truth. There is no desired state. There is only the immediate, fleeting command.
The junior’s script didn't account for the fact that the `auth-provider` container had crashed five minutes earlier. Because there was no orchestration layer to verify the health of dependencies, the `api-gateway` started, tried to connect to a non-existent auth service, and entered a crash loop. But the script didn't care. It just kept running `docker run` for the next ten services, each one failing more spectacularly than the last.
By the time I was paged at 3:15 AM, the host was a graveyard of containers in `Exited (1)` states, all of them holding onto port bindings that prevented the script from being re-run. This is the "Death Spiral." Without the declarative nature of `docker compose`, you aren't managing a system; you're playing a high-stakes game of Whac-A-Mole where the hammer is broken and the moles are on fire.
## 2. The Race Condition of the Unchecked Container
The second stage of the failure was the database. In the junior's "orchestration" script, the PostgreSQL container was started first, followed immediately by the application.
```bash
# The "Clever" Way (i.e., The Wrong Way)
docker run -d --name postgres-db postgres:16
docker run -d --name app-service my-app:latest
The problem? PostgreSQL takes approximately 10 to 15 seconds to initialize its internal storage and start listening on port 5432. The application container, written in a language that prizes “startup performance,” tried to connect in 0.5 seconds. It failed. It died. It didn’t have a retry loop because “the infrastructure should handle it.”
This is where we exploit the depends_on feature of the Compose specification v2.24.6. We don’t just want the container to be running; we want it to be ready.
services:
postgres-db:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
networks:
- backend_mesh
app-service:
image: my-app:latest
depends_on:
postgres-db:
condition: service_healthy
networks:
- backend_mesh
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M
In this snippet, docker compose acts as the adult in the room. It understands the dependency graph. It waits for the pg_isready command to return a zero exit code before it even attempts to pull the app-service image. This prevents the 3 AM “Database not found” errors that haunt my nightmares. The junior dev’s bash script has no concept of a healthcheck. It only knows if the docker run command itself succeeded, which it always does, because the daemon successfully started the container, even if the process inside is currently vomiting stack traces.
Table of Contents
3. Networking is a Lie (and the Default Bridge is a Trap)
If you don’t define a network in docker compose, or if you manually run containers, they often end up on the default bridge network. This is a legacy wasteland. On the default bridge, containers cannot resolve each other by name. You have to use IP addresses.
The junior’s script tried to solve this by parsing docker inspect output to find the IP of the database and passing it as an environment variable to the app.
War Story: Three years ago, at a fintech startup that shall remain nameless, we had a dev who did exactly this. One night, the Docker daemon restarted. The containers came back up in a different order. The database, which used to be 172.17.0.2, was now 172.17.0.3. The application spent four hours sending encrypted transaction data to a defunct Nginx cache container that happened to grab the old IP. We lost $40k in unrecoverable API calls because someone thought docker compose networks were “too complex.”
Using docker compose forces the creation of a user-defined bridge network. This provides automatic DNS resolution. The app-service can simply look for postgres-db.
networks:
backend_mesh:
driver: bridge
ipam:
config:
- subnet: 10.5.0.0/16
gateway: 10.5.0.1
By explicitly defining the network, we isolate the traffic. The api-gateway doesn’t need to be on the same network as the postgres-db. We can create a frontend_net and a backend_net, and only the app-service sits on both. This is basic security posture, yet it’s impossible to manage via a bash script without writing a 500-line wrapper around docker network create and docker network connect.
4. The UID/GID Purgatory of Persistent Volumes
We hit a Sev-0 because the logs couldn’t be written. Why? Because the bash script ran docker run -v /var/log/app:/app/logs. On the host, /var/log/app was owned by root. Inside the container, the app was running as node (UID 1000).
Result: EACCES: permission denied, open '/app/logs/error.log'.
The container crashed. The bash script, in its infinite wisdom, saw the crash and tried to restart it. But it didn’t clean up the volume. It just kept trying, creating a loop that filled the kernel’s process table.
In docker compose, we can manage volume definitions and even use user: mapping or bind mounts with specific propagation settings. But more importantly, Compose allows us to standardize the environment across every developer’s machine.
services:
app-service:
image: my-app:latest
volumes:
- type: bind
source: ./logs
target: /app/logs
read_only: false
- type: volume
source: app_data
target: /data
user: "${UID}:${GID}"
volumes:
app_data:
driver: local
By using ${UID}:${GID}, we can pass the host user’s identity into the container at runtime. This prevents the “it works on my machine” syndrome where a dev runs everything as root on their Ubuntu laptop, but the production RHEL server (rightfully) screams in agony. The junior’s script had hardcoded paths. When it ran on the CI/CD runner, it tried to mount /Users/juniordev/project/logs. There is no /Users on a Linux production node. The deployment failed, the script didn’t catch the error, and we pushed a broken config to the entire cluster.
5. Environment Variable Poisoning and Secret Leakage
The bash script was a sieve for secrets. To get the environment variables into the containers, the script used a series of -e flags.
docker run -e DB_PASSWORD=$DB_PASSWORD -e API_KEY=$API_KEY ...
Do you know where those variables end up? In the process list. Anyone with access to the host can run ps aux or docker inspect and see the production database password in plain text. It’s an auditor’s nightmare and a security engineer’s reason for early retirement.
docker compose supports env_file and, more importantly, it respects the hierarchy of configuration. We can use a .env file for local development and override it with actual secrets management in production.
services:
backend:
image: backend:v1.2.3
env_file:
- .env.base
- .env.production
environment:
- NODE_ENV=production
- DEBUG=false
secrets:
- db_password
- api_key
secrets:
db_password:
external: true
api_key:
file: ./secrets/api_key.txt
Using the secrets directive in Compose (even in non-Swarm mode, though it’s limited to file mounts) is a step toward sanity. It separates the configuration of the application from the credentials of the application. The junior’s script blurred these lines until they were non-existent. I found the production Stripe key in the shell history of the jump box. I’m still vibrating with rage.
6. Profiles: Pruning the Resource-Hungry Forest
Our local development environment is a beast. We have 42 microservices. No single developer needs all 42 running to fix a CSS bug in the billing UI. The junior’s bash script, however, didn’t have “modes.” It just started everything.
This led to “The Great Meltdown of Tuesday,” where three new hires tried to run the script simultaneously on their 16GB MacBooks. The resulting OOM kills and swap-file thrashing brought the local network to its knees as Docker tried to pull 80GB of images at once.
Compose V2 (specifically v2.24.6) handles this with profiles.
services:
frontend:
image: frontend:latest
profiles: ["ui", "full-stack"]
ports:
- "3000:3000"
billing-service:
image: billing:latest
profiles: ["billing", "full-stack"]
legacy-monolith:
image: monolith:latest
profiles: ["debug"]
logging:
driver: "json-file"
options:
max-size: "10m"
With docker compose --profile ui up, a developer only gets what they need. The bash script didn’t have this granularity. It was all or nothing. And “all” usually meant the developer’s machine became a very expensive space heater.
The profiles feature also allows us to include debugging tools—like a containerized Wireshark or a database GUI—that should never be started in production but are invaluable in staging. The bash script just ran whatever was in the docker run list, meaning we accidentally deployed adminer (a database management tool) to a public-facing endpoint. We were lucky the firewall caught it. The script certainly didn’t.
7. The Orchestration Tax and the V2 Specification
Let’s talk about the nuances of tty: true and stdin_open: true. In a manual docker run, you often forget these. Then you wonder why your process doesn’t receive SIGTERM signals and takes 10 seconds to die (until the kernel sends a SIGKILL).
When you use docker compose, the lifecycle management is handled according to the Compose Specification. When I run docker compose stop, the plugin sends SIGTERM to the processes in the correct order, respecting the stop_grace_period.
The junior’s script used docker kill. It didn’t wait for the application to flush its buffers or close database connections. We ended up with corrupted WAL files in Postgres and half-written JSON payloads in our S3 buckets.
The V2 specification (v2.24.6) also introduced better support for the build context.
services:
custom-app:
build:
context: .
dockerfile: Dockerfile.prod
args:
- BUILD_VERSION=1.2.3
cache_from:
- type=registry,ref=myrepo/cache:latest
image: myrepo/custom-app:v1.2.3
This allows us to exploit BuildKit features like cache imports and multi-stage build targets directly from the orchestration file. The bash script was doing docker build -t my-app . every single time, ignoring the cache and wasting 20 minutes of CI time per PR.
I spent four hours of my shift just rewriting the networking bridge configuration because the bash script had created a conflict with the corporate VPN’s routing table. docker compose allows you to define the com.docker.network.bridge.name and other driver options to avoid this.
networks:
corporate_safe:
driver: bridge
driver_opts:
com.docker.network.bridge.name: br-prod-safe
com.docker.network.bridge.enable_icc: "true"
The junior didn’t know what ICC (Inter-Container Communication) was. They just knew that “sometimes the containers can’t talk,” so they disabled the host firewall. Let that sink in. They disabled the firewall on a production node because they couldn’t figure out Docker networking.
Hard Truths
- Your bash script is not “simpler” than YAML; it is a debt-laden hallucination that will fail the moment a network packet is dropped.
- If you are using
docker runin a production environment, you are not an SRE; you are a digital arsonist. - “It works on my machine” is a valid reason for immediate revocation of SSH access.
- The default Docker bridge is a security hole and a DNS nightmare; use named networks or don’t use Docker at all.
depends_onwithout ahealthcheckis just a race condition with a fancy name.- If you don’t set memory and CPU limits in your Compose file, you are giving your containers a license to kill the host.
- The time you “save” by avoiding
docker composewill be repaid tenfold in the form of 3 AM incidents and my caffeine-induced wrath. - Version 2 of the Compose plugin is not a suggestion; it is the standard. Use it or find a job in a field where failure doesn’t involve a pager.
- A container that requires manual
chowncommands to start is a failed container. - Orchestration is not about making things easy; it’s about making them predictable. Your script is the opposite of predictable.
I’m going to sleep now. If I get paged because someone touched that deploy.sh script again, I’m not fixing the server. I’m deleting the repository. Use docker compose. It’s not for the machine’s sake; it’s for mine.
Related Articles
Explore more insights and best practices: