INCIDENT SUMMARY
| Attribute | Details |
|---|---|
| Incident ID | BKR-2024-09-12-CRITICAL |
| Severity | Level 0 (Existential Threat) |
| Status | Resolved (Post-Mortem Stage) |
| Duration | 74 Hours, 12 Minutes |
| Impact | $412,000 in unplanned AWS spend; 99.9% API latency increase; Total CI/CD paralysis. |
| Primary Root Cause | Failure to implement aws best practices regarding VPC Endpoints, IAM scoping, and Terraform state management. |
Table of Contents
TIMELINE OF THE COLLAPSE
- 2024-09-10 09:00 EST: Migration of the “Legacy-Core” monolith to the new
prod-v2VPC begins. The engineering lead decides to “keep it simple” by using a single Transit Gateway for all inter-region traffic. - 2024-09-10 14:30 EST: Terraform apply (v1.7.0) finishes. The state file is stored in an S3 bucket named
company-tf-state-prod. Versioning is disabled because “we wanted to save on storage costs.” - 2024-09-11 02:15 EST: PagerDuty triggers. The
OrderProcessorservice is timing out. Latency has spiked from 45ms to 12,000ms. - 2024-09-11 04:00 EST: I am woken up. I find the NAT Gateway in
us-east-1ais processing 4.2 TB of data per hour. - 2024-09-11 08:00 EST: CFO sends an urgent Slack message. The AWS Cost Explorer “Daily Spend” view shows a vertical line. We are burning $15,000 an hour.
- 2024-09-11 11:00 EST: A junior engineer attempts to “fix” the routing table and accidentally deletes the Terraform state file from S3 using an over-privileged IAM role.
- 2024-09-12 15:00 EST: Manual recovery of 450 resources begins. The “Cloud-Native” dream is officially a nightmare.
The IAM Policy That Ate Our Budget
The first mistake wasn’t technical; it was philosophical. The team treated IAM like a nuisance rather than a perimeter. They wanted “velocity,” which in this industry is usually code for “I don’t want to read the documentation.”
We found a role named FullAccessAppRole. It was attached to every EC2 instance in the auto-scaling group. Here is the JSON policy that allowed a compromised container to start spinning up p4d.24xlarge instances in regions we don’t even operate in:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::123456789012:role/*"
}
]
}
The iam:PassRole on * is the smoking gun. It allowed any service with this role to pass any other role to a new service. When the OrderProcessor was hit with a basic SSRF (Server-Side Request Forgery) attack, the attacker didn’t just steal data; they used our own infrastructure to mine Monero.
I ran this jq filter against our IAM export to see how deep the rot went:
aws iam list-policies --scope Local --output json | \
jq -r '.Policies[] | select(.AttachmentCount > 0) | .PolicyName' | \
xargs -I {} aws iam get-policy-version --policy-arn arn:aws:iam::123456789012:policy/{} --version-id v1 --output json | \
jq '.PolicyVersion.Document.Statement[] | select(.Action == "*" and .Resource == "*")'
The output was a wall of text. We had 14 different policies granting administrative access to service accounts that only needed to read from a single S3 bucket. By ignoring aws best practices for “Least Privilege,” we handed the keys to the kingdom to anyone who could find a bug in our Node.js runtime.
The Networking Nightmare: Transit Gateway and CIDR Collisions
The “architects” decided that VPC Peering was “too hard to manage.” They opted for a Transit Gateway (TGW) to connect our legacy VPC, our new VPC, and our on-prem data center.
The problem? They didn’t plan the CIDR blocks. They used 10.0.0.0/16 for everything. When you have overlapping CIDRs in a TGW environment, the routing table becomes a game of Russian Roulette.
We saw packets destined for the database (10.0.5.22) being routed back to the legacy VPN gateway because the TGW route table had a more specific prefix (10.0.5.0/24) pointing to the wrong attachment.
Here is the Terraform snippet that caused the loop:
resource "aws_ec2_transit_gateway_route" "loop_of_death" {
destination_cidr_block = "10.0.0.0/8" # Why? Just why?
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.legacy.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.main.id
}
By using a /8 summary route, they effectively blackholed any internal traffic that didn’t have an explicit match. But the real financial pain came from the NAT Gateway.
The prod-v2 VPC was configured with all instances in private subnets. Standard stuff. However, they forgot to provision an S3 Gateway Endpoint. Every time an instance pulled a 2GB container image from S3, or uploaded a log file, that data traveled out through the NAT Gateway.
AWS charges $0.045 per GB for NAT Gateway data processing. That sounds small until you realize your logging agent is pushing 500GB of debug logs an hour because someone left LOG_LEVEL=trace on in production.
I caught this using the following AWS CLI command to inspect the CloudWatch metrics for the NAT Gateway:
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name BytesOutToDestination \
--dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6g7h8 \
--start-time 2024-09-11T00:00:00Z \
--end-time 2024-09-11T23:59:59Z \
--period 3600 \
--statistics Sum \
--unit Bytes
The result showed we were processing petabytes of data that should have stayed on the AWS internal network via a free VPC Endpoint. We were paying $45.00 per TB for the privilege of moving data three feet across the data center floor.
The Cost of Ignorance: gp2 Throttling and IOPS Debt
While the NAT Gateway was bleeding us dry, the database was dying a slow death. The team had provisioned 1TB gp2 volumes for the Postgres nodes. They thought “1TB is plenty of space.”
They didn’t understand the gp2 burst bucket model. On gp2, you get 3 IOPS per GB. A 1TB volume gives you a baseline of 3,000 IOPS. If you need more, you consume “burst credits.” Once those credits are gone, you are hard-capped at 3,000 IOPS.
Our database was hitting 12,000 IOPS during the morning peak. For the first two hours, it was fine. Then, the burst bucket hit zero. Latency went from 2ms to 200ms instantly. The application servers, waiting for the DB, started stacking threads. The health checks failed. The Auto Scaling Group (ASG) thought the instances were dead and terminated them.
The new instances started up, tried to pull the massive container images through the NAT Gateway (adding to the cost), and then immediately hit the same IOPS-throttled database. It was a circular dependency of failure.
We should have used gp3. With gp3, you get 3,000 IOPS baseline regardless of volume size, and you can scale IOPS and throughput independently.
I used this command to identify every throttled volume in the account:
aws cloudwatch get-metric-data \
--metric-data-queries '[{"Id":"m1","MetricStat":{"Metric":{"Namespace":"AWS/EBS","MetricName":"BurstBalance","Dimensions":[{"Name":"VolumeId","Value":"vol-0987654321fedcba"}]},"Period":300,"Stat":"Average"}}]' \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
The BurstBalance was at 0%. We were running a production database on the performance equivalent of a 5,400 RPM laptop drive from 2004.
The State File Disaster: Terraform in the Dark
This is the part that still makes my hands shake. Terraform is a powerful tool, but in the hands of someone who treats it like a bash script, it’s a suicide machine.
The team had configured the S3 backend for Terraform like this:
terraform {
backend "s3" {
bucket = "company-tf-state-prod"
key = "network/terraform.tfstate"
region = "us-east-1"
# dynamodb_table = "terraform-lock" # COMMENTED OUT BECAUSE "IT WAS SLOW"
}
}
No DynamoDB table for state locking. No S3 bucket versioning. No MFA delete.
On the second day of the outage, two engineers were trying to fix the NAT Gateway issue simultaneously. Engineer A ran a terraform apply to add the VPC Endpoint. Engineer B, unaware of A’s work, was manually editing the state file because of a “provider drift” issue.
Engineer B’s manual upload corrupted the JSON structure of the state file. When Engineer A’s apply finished, it attempted to write back to the corrupted file. The S3 object was overwritten with a 0-byte file.
Because versioning was disabled, the state of our entire infrastructure—450 resources, including the RDS instances, the TGW, and the IAM roles—was gone. Terraform now thought the world was empty.
The next time someone ran terraform plan, the output was a nightmare:
Plan: 450 to add, 0 to change, 0 to destroy.
If they had clicked “yes,” Terraform would have tried to recreate resources that already existed, failing on “Name already in use” errors for the next six hours while the site stayed down.
I had to use the aws resourcegroupstaggingapi to try and map existing ARNs back to Terraform resource addresses. It was a manual, grueling process of terraform import commands that took 14 hours of straight work.
# One of 450 imports. I did this until my eyes bled.
terraform import aws_instance.web_server i-0123456789abcdef0
The S3 Leak That Wasn’t a Leak—Until It Was
In the middle of the recovery, we discovered that the “fix” for a permissions error earlier in the month was to make an S3 bucket public. The bucket contained “static assets.”
The problem is that “static assets” to a junior dev included config.json files that contained database credentials and API keys for our payment processor.
They didn’t use S3 Block Public Access at the account level. They didn’t use Bucket Policies to restrict access to the VPC. They just flipped the switch.
I found it using this:
aws s3api get-public-access-block --bucket company-secrets-prod --output json
The response was:
{
"PublicAccessBlockConfiguration": {
"BlockPublicAcls": false,
"IgnorePublicAcls": false,
"BlockPublicPolicy": false,
"RestrictPublicBuckets": false
}
}
This is a direct violation of every aws best practice in the book. We had to rotate every single credential in the company. Every database password, every Stripe key, every SendGrid token. The operational overhead of rotating 200+ secrets while the network was already failing is why I have grey hair.
The Silent Killer: CloudWatch Logs and Retention Policies
The final insult to our bank account was the CloudWatch Logs bill. When you are in a “crisis,” everyone turns on “Debug” logging.
We had 400 microservices running in EKS. Each one was spitting out 10MB of logs per minute. The team had set the retention policy to “Never Expire.”
CloudWatch Logs ingestion costs $0.50 per GB. Storage costs $0.03 per GB-month. By day three, we had ingested 80TB of logs. Most of those logs were just the same “Connection Timeout” error repeated billions of times.
I had to write a script to truncate the retention on every log group in the account because doing it through the console was too slow:
for group in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do
echo "Setting retention for $group"
aws logs put-retention-policy --log-group-name "$group" --retention-in-days 7
done
We were paying for the storage of errors that occurred in a version of the software that didn’t even exist anymore. It was digital hoarding at a corporate scale.
Conclusion: The Price of “Moving Fast”
We didn’t almost bankrupt the company because AWS is expensive. We almost bankrupted the company because we treated the cloud like someone else’s data center where the resources are infinite and the configuration doesn’t matter.
We ignored the aws best practices for VPC design, opting for the “easy” path that led to a $50k NAT Gateway bill. We ignored IAM scoping, leading to a crypto-jacking incident that cost us $120k in compute. We ignored Terraform state management, which nearly made our infrastructure unrecoverable.
The “Cloud-Native” transition isn’t about moving your VMs to EC2. It’s about understanding the underlying mechanics of the platform. If you don’t understand how IOPS credits work, if you don’t understand how VPC Endpoints save you money, and if you don’t understand that IAM is your only real firewall, you aren’t “innovating.” You are just waiting for a PagerDuty alert that you can’t fix.
I’m going back to sleep. Don’t touch the terraform.tfstate file, or I’ll revoke your AssumeRole permissions before you can finish your next git push.
Related Articles
Explore more insights and best practices: