This is not a “retrospective.” It is not a “learning opportunity.” It is a post-mortem of a preventable disaster that cost this company six figures in lost revenue and cost me ten years of my life expectancy. If I see one more “move fast and break things” sticker on a laptop in this office, I am going to lose what little remains of my sanity.
We didn’t “break things.” We incinerated them. We took a decade of industry-standard reliability engineering and threw it into a woodchipper because someone thought Terraform modules were “too restrictive” and that “manual tweaks in the console” were faster for a Friday afternoon deployment.
The following is a technical breakdown of the Great Infrastructure Meltdown of 2023. Read it. Internalize it. Because if we don’t start following aws best practices immediately, the next time the pager goes off at 3:00 AM, I’m not logging in; I’m deleting my Slack account and moving to a farm in Vermont.
Table of Contents
H2: The 3:00 AM Wake-up Call: A Timeline of Failure
It started with a single 503 error. Then ten. Then ten thousand. My PagerDuty alert didn’t just beep; it screamed. By 03:05 UTC, the entire US-EAST-1 footprint was a graveyard of timed-out requests and “Connection Refused” errors.
Incident Timeline (All times UTC):
- 02:47:05: A junior developer, working on a “hotfix” for the legacy billing service, executes a local script using a compromised IAM credential that had full
AdministratorAccess. - 02:50:12: The script, intended to clear a “temp” bucket, instead targets the production S3 bucket containing our centralized Terraform state files and the primary application assets.
- 02:55:00: CloudWatch alarms trigger for the
api-gateway-prodservice. Latency spikes from 45ms to 12,000ms. - 03:00:15: My pager goes off. I attempt to log into the AWS Console. I am met with an “Access Denied” error because the script also modified the IAM role I use for emergency access.
- 03:10:45: The Auto-Scaling Group (ASG) for the core microservices begins a “death spiral.” It detects unhealthy instances and attempts to replace them, but the launch templates are referencing AMI IDs that no longer exist in the registry.
- 03:22:30: The RDS Primary instance in
us-east-1asuffers a storage failure. Because we were running a Single-AZ deployment to “save on data transfer costs,” there is no standby to fail over to. - 04:00:00: Total blackout. Internal VPN is down. Public-facing API is down. The status page is down because it was hosted on the same infrastructure it was supposed to monitor.
The logs from the initial failure were a mess of AccessDenied and ResourceNotFoundException. Here is what I saw when I finally regained read-only access to the CloudTrail logs:
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROAEXAMPLE:dev-session-123",
"arn": "arn:aws:sts::123456789012:assumed-role/FullAdminDev/dev-session-123"
},
"eventTime": "2023-10-14T02:50:12Z",
"eventSource": "s3.amazonaws.com",
"eventName": "DeleteBucket",
"requestParameters": {
"bucketName": "prod-terraform-state-us-east-1"
},
"responseElements": null,
"userAgent": "aws-cli/2.13.5 Python/3.11.4 Linux/5.15.0-76-generic exe/x86_64.ubuntu.22",
"errorCode": "AccessDenied",
"errorMessage": "Access Denied"
}
Wait, look at that log. It says AccessDenied for the bucket deletion, but the script didn’t stop there. It proceeded to purge the objects within the bucket because the IAM policy was so poorly constructed that while it couldn’t delete the bucket itself, it had s3:DeleteObject on *.
H2: The IAM Policy That Ate the Production Environment
We talk about the “Principle of Least Privilege” like it’s some optional suggestion, like “floss every day” or “don’t eat raw cookie dough.” It’s not. It’s the only thing standing between a typo and a bankruptcy filing.
The “FullAdminDev” role used in the incident was a relic of the “we need to move fast” era. It was a JSON monstrosity that allowed anyone in the engineering org to do anything. We weren’t following aws best security standards; we were running a digital Wild West.
Here is the policy that allowed the destruction of our state files:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"ec2:*",
"rds:*",
"iam:*"
],
"Resource": "*"
}
]
}
This is not a policy. This is a suicide note. By using Resource: "*", we gave a single compromised key the power to wipe out the entire VPC, delete the RDS snapshots, and—most cruelly—modify the very IAM roles we would need to fix the mess.
When the script ran, it didn’t just delete the Terraform state. It “drifted” the entire environment. When I tried to run a terraform plan from my local machine (using a backup state file I had to pull from a physical drive like it was 1998), the output was a horror show.
Terraform v1.5.7
on linux_amd64
Configuring remote state backend...
Initializing modules...
Terraform used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
- destroy
+/- read only and then update
Terraform will perform the following actions:
# module.vpc.aws_vpc.main will be updated in-place
~ resource "aws_vpc" "main" {
id = "vpc-0a1b2c3d4e5f6g7h8"
~ enable_dns_support = true -> false
- tags = {
"Environment" = "Production"
"ManagedBy" = "Terraform"
} -> null
}
# module.rds.aws_db_instance.primary will be destroyed
# (because aws_db_instance.primary is not in the state)
- resource "aws_db_instance" "primary" {
- id = "prod-db-instance" -> null
}
Plan: 0 to add, 1 to change, 45 to destroy.
Forty-five resources to destroy. The state was gone, the tags were gone, and the infrastructure was “orphaned.” We had no way to know what was actually running versus what Terraform thought was running. This is what happens when you treat IAM as an afterthought.
H2: Networking Spaghetti and the VPC Peering Nightmare
If the IAM failure was the spark, the networking configuration was the gasoline. For reasons that defy logic, our VPC was designed with a /16 CIDR block that overlapped with our legacy on-premise data center. To “fix” this, someone had implemented a series of VPC peering connections and static routes that looked like a bowl of spaghetti thrown against a wall.
During the meltdown, the routing tables were corrupted. The “clever” script had iterated through all route tables and removed any route it didn’t recognize. Because our routing tables were modified manually over the last two years and never checked back into Git, the script saw them as “drift” and nuked them.
I spent two hours trying to figure out why I couldn’t SSH into the bastion host. The reason? The route to the Internet Gateway (IGW) was gone.
$ aws ec2 describe-route-tables --route-table-ids rtb-0492834092834 --profile prod
{
"RouteTables": [
{
"Associations": [],
"PropagatingVgws": [],
"RouteTableId": "rtb-0492834092834",
"Routes": [
{
"DestinationCidrBlock": "10.0.0.0/16",
"GatewayId": "local",
"Origin": "CreateRouteTable",
"State": "active"
}
],
"Tags": [],
"VpcId": "vpc-0a1b2c3d4e5f6g7h8"
}
]
}
Notice anything missing? The 0.0.0.0/0 route to the IGW is gone. The subnet was effectively blackholed. We had no ingress, no egress, and no hope. We were running in a dark room with no doors.
The IP exhaustion was the final nail. Because we hadn’t planned our subnets correctly, the “death spiral” of the ASG (which I’ll get to in a moment) consumed every available IP in the private subnets. New instances couldn’t start because there were no addresses left in the pool. We were hit by a cascading failure where the networking layer was actively preventing the compute layer from recovering.
H2: Why Your Auto-Scaling Group is a Lie
Everyone loves Auto-Scaling Groups until they actually have to scale. Our ASG configuration was a masterclass in how not to build resilient systems.
First, the health checks. We were using EC2 health checks instead of ELB health checks. For the uninitiated, an EC2 health check only cares if the VM is powered on. It doesn’t care if the Java application inside is stuck in a garbage collection loop or if the disk is 100% full. The instances were “healthy” according to AWS, but they were returning 500 errors to every single user.
When we finally switched to ELB health checks mid-crisis, the ASG realized every single instance was failing. It did what it was programmed to do: it terminated all of them at once.
This is called “flapping.” The ASG kills an instance, starts a new one, the new one fails the health check because the database is down, the ASG kills it again. Repeat until your AWS bill is the size of a small nation’s GDP.
I ran this command to see the carnage:
$ aws autoscaling describe-scaling-activities --auto-scaling-group-name prod-api-asg --max-items 5
{
"Activities": [
{
"ActivityId": "82374-2342-234-234234",
"AutoScalingGroupName": "prod-api-asg",
"Cause": "At 2023-10-14T03:45:12Z an instance was started in response to a difference between desired and actual capacity.",
"StartTime": "2023-10-14T03:45:12.123Z",
"EndTime": "2023-10-14T03:46:01.000Z",
"StatusCode": "Failed",
"StatusMessage": "Instance i-0abcd1234efgh5678 was terminated because it failed ELB health checks."
}
]
}
The “Cause” was always the same. The instances couldn’t connect to the RDS instance because the RDS instance was in a different circle of hell. But the ASG didn’t know that. It just kept throwing wood into the fire, hoping the fire would eventually put itself out.
We also had no “Cooldown” period defined. As soon as one instance died, another was spun up. We hit our account limits for EC2 instances within twenty minutes. When I finally tried to manually scale up a fleet of “safe” instances, I was met with: You have reached your quota of Max Instances.
We were locked out of our own account by our own incompetence.
H2: The Database Deadlock: RDS and the Missing Multi-AZ
Now we get to the heart of the data loss. Someone—and I have the Jira ticket saved for my legal defense—decided that we didn’t need Multi-AZ for the production RDS instance. “It’s twice the price,” they said. “The SLA is good enough,” they said.
The SLA is a refund, not a time machine. When the underlying hardware in us-east-1a failed at 03:22 UTC, the database went offline. In a Multi-AZ setup, AWS would have detected this and flipped the DNS record to the standby in us-east-1b within sixty seconds.
Instead, we had a “Single-AZ” instance that was now a brick.
Because the script earlier had also “cleaned up” old snapshots to save on storage costs (again, “cost-saving” will be the death of this company), our latest usable snapshot was from 24 hours ago. We didn’t just lose uptime; we lost a full day of customer transactions.
I had to manually initiate a restore from a snapshot. Do you know how long it takes to restore a 2TB Postgres database from an S3 snapshot during a regional brownout? It takes four hours and twelve minutes. Four hours of sitting in a Zoom room with executives who are asking “Is it done yet?” every thirty seconds.
The aws best practice here isn’t just “turn on Multi-AZ.” It’s “implement a multi-region failover strategy with automated point-in-time recovery (PITR).” We had neither. We had a single point of failure that we had intentionally weakened to save $400 a month.
While the restore was running, I checked the logs for the RDS instance. The IOPS were flatlined. We were using gp2 volumes instead of gp3 or io2, and we had exhausted our burst balance. The database wasn’t just down; it was suffocating.
$ aws rds describe-db-instances --db-instance-identifier prod-db-primary
{
"DBInstances": [
{
"DBInstanceIdentifier": "prod-db-primary",
"DBInstanceStatus": "failed",
"Engine": "postgres",
"EngineVersion": "15.3",
"MultiAZ": false,
"StorageType": "gp2",
"AllocatedStorage": 2000,
"PendingModifiedValues": {}
}
]
}
The status “failed” is a very lonely word to see at 4:00 AM.
H2: Hard Lessons and the Path to Infrastructure Nihilism
We are currently back online, but we are running on “hope.” The infrastructure is a patchwork of manual fixes, temporary security groups, and a Terraform state file that I had to manually reconstruct using terraform import for over three hundred resources.
I am tired. I am cynical. And I am finished with “clever” solutions.
If we are to survive as an engineering organization, we must stop treating our infrastructure like a playground. We are not “embarking” on a journey; we are digging ourselves out of a hole.
Here are my non-negotiable demands:
- Total Terraform Rewrite: Every single resource must be defined in Terraform v1.5.7 or higher. No more manual changes. If it’s not in Git, it doesn’t exist. We will use
prevent_destroyon all critical resources, including RDS and S3 state buckets. - IAM Overhaul: We will implement Service Control Policies (SCPs) at the AWS Organizations level to prevent anyone—including me—from deleting core infrastructure. We will move to short-lived credentials using AWS IAM Identity Center. No more long-lived access keys on developer laptops.
- Observability as a Requirement: If a service doesn’t have a structured JSON log output, a health check endpoint that actually checks dependencies, and a CloudWatch dashboard, it doesn’t get deployed.
- Redundancy is Not Optional: Multi-AZ is the bare minimum. We will begin testing cross-region failover for our core API. If the business thinks it’s too expensive, they can calculate the cost of 8 hours of total downtime and get back to me.
- Blast Radius Reduction: We will split our single “Production” account into multiple accounts using AWS Control Tower. Billing, Logging, Security, and Application workloads will live in separate sandboxes. If a dev key is compromised in the future, it should only be able to burn down a small shed, not the entire skyscraper.
The “Great Infrastructure Meltdown of 2023” was not an act of God. It was not a “glitch” in AWS. It was a failure of engineering discipline. We prioritized speed over stability, and we paid the price in reputation and stress.
I am going to sleep now. When I wake up, I expect to see a pull request for the new IAM permission boundaries. If I see a Resource: "*" in that PR, I’m retiring to Vermont to milk cows. Cows don’t have APIs. Cows don’t have overlapping CIDR blocks. And cows certainly don’t page you at 3:00 AM because someone deleted the state file.
Fix it. All of it. Now.
Related Articles
Explore more insights and best practices: