Master AWS Best Practices: Optimize Your Cloud Performance

INTERNAL DOCUMENT: POST-MORTEM REPORT – PROJECT “SILVER LINING” (MIGRATION FAILURE)
FROM: Senior Systems Architect (Infrastructure & Physical Security)
TO: The C-Suite and the “Cloud Native” Evangelists who broke the bank.
DATE: 2024-05-22
SUBJECT: Why we are broke and why my pager didn’t stop buzzing for 72 hours.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowDevsToBreakEverything",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::prod-customer-data-sensitive/*",
                "arn:aws:s3:::prod-customer-data-sensitive"
            ],
            "Condition": {
                "StringLike": {
                    "aws:Referer": [
                        "http://localhost:3000",
                        "*"
                    ]
                }
            }
        }
    ]
}

Look at that. Look at the JSON above. That was found in our production environment three days ago. Some “Full Stack Architect” thought that adding a wildcard to the Referer condition while allowing Principal: "*" was a “quick fix” for a CORS issue during a late-night deployment. This is what happens when you trade physical firewalls and air-gapped subnets for a web console that looks like a toy store. You wanted “agility.” Well, you got it. You’re agile enough to leap right off a cliff and take the company’s valuation with you.

I’ve spent 25 years in climate-controlled rooms where I could actually touch the hardware. I knew where the packets went because I laid the CAT6 myself. Now? Now I’m chasing ghosts in “Availability Zones” that are just warehouses in Northern Virginia that I’m not allowed to enter. The cloud isn’t a revolution; it’s a landlord-tenant dispute where the landlord raises the rent every time you turn on a lightbulb.

Table of Contents

TICKET-404-WHERE-DID-THE-MONEY-GO

We blew the quarterly budget in twenty-one days. Twenty-one. I remember when a $50,000 CAPEX request for a new SAN would get me grilled for three hours by the CFO. Now, a junior dev with a credit card and a lack of sleep can spin up a p4d.24xlarge instance because they wanted to “test some ML models” and forgot to shut it down over the weekend.

For those of you who don’t read the billing console—because it’s designed to be as confusing as a tax code—a p4d.24xlarge costs roughly $32.77 per hour. That’s $786 a day. For one instance. We found six of them. They were zombie instances, sitting there, idling at 1% CPU utilization, burning money just to keep the NVIDIA A100s warm.

# AWS CLI v2.15.30 - Hunting for the gold-plated paperweights
aws ec2 describe-instances \
    --filters "Name=instance-type,Values=p4d.24xlarge" \
    --query "Reservations[*].Instances[*].{ID:InstanceId, LaunchTime:LaunchTime, State:State.Name}" \
    --output table

---------------------------------------------------------------------------
|                            DescribeInstances                            |
+----------------------+---------------------------+----------------------+
|          ID          |        LaunchTime         |        State         |
+----------------------+---------------------------+----------------------+
|  i-0abcd1234efgh5678 |  2024-05-10T03:14:22+00:00|  running             |
|  i-09876fedcba54321  |  2024-05-10T03:15:45+00:00|  running             |
|  i-0123456789abcdef  |  2024-05-10T04:00:12+00:00|  running             |
+----------------------+---------------------------+----------------------+

When I asked why we needed 400Gbps networking for a CRUD app that serves maybe fifty concurrent users, I was told it was for “future-proofing.” In my day, future-proofing meant buying a chassis with two extra blade slots. In the cloud, it means paying for a Ferrari to drive to the mailbox.

And don’t get me started on the egress fees. We moved 40TB of legacy logs from S3 to an on-prem archival server because someone finally realized that keeping “debug_log_final_v2_OLD.txt” in Standard Tier storage was costing us $900 a month. The bill for just moving that data out? $3,600. AWS charges you to leave the party. It’s a digital Hotel California. You can check out any time you like, but your data is held for ransom by the byte.

#ops-nightmares-vpc-peering-hell

People talk about aws best practices like they’re some holy scripture, but they usually just ignore the basics of CIDR blocks and wonder why their packets are disappearing into the ether. We have seventeen VPCs. Why? Because every time a new team starts a project, they click “Create VPC” with the default 172.31.0.0/16 range.

Do you know what happens when you try to peer two VPCs with overlapping CIDR blocks? Nothing. Exactly nothing happens. No traffic flows, and you spend four hours debugging a routing table only to realize you’ve built a logical paradox.

Then came the “solution”: Transit Gateway.

# Terraform 1.7.4 - The "Solution" that costs $36/day just to exist
resource "aws_ec2_transit_gateway" "main" {
  description = "The expensive hub for our overlapping mess"
  amazon_side_asn = 64512
  auto_accept_shared_attachments = "enable"
  default_route_table_association = "enable"

  tags = {
    Name = "Money-Pit-Gateway"
    Environment = "Production"
  }
}

resource "aws_ec2_transit_gateway_vpc_attachment" "attachment" {
  subnet_ids         = [aws_subnet.private_a.id, aws_subnet.private_b.id]
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.app_vpc.id
}

Transit Gateway is great if you love paying $0.05 per hour per attachment, plus data processing fees. We’re paying for the privilege of routing our own internal traffic. On-prem, I had a Cisco Nexus 9k. I bought it once. It stayed in the rack. It didn’t charge me every time a packet went from VLAN 10 to VLAN 20. In the cloud, every hop is a micro-transaction. It’s like the entire infrastructure was designed by someone who used to make mobile games with “energy systems.”

The latency? Don’t even get me started. We’re seeing 15ms spikes because someone decided to put the database in us-east-1a and the application servers in us-east-1b to “ensure high availability.” Great, now every DB query has to traverse the inter-AZ fiber, and we’re getting billed for “Inter-AZ Data Transfer.” We are literally paying for the speed of light.

TICKET-882-S3-LEAK-AND-THE-FALSE-SENSE-OF-SECURITY

S3 Block Public Access should not be a software setting. It should be a physical, red, locking toggle switch on a wall in the data center. If the switch is UP, the data stays inside. If the switch is DOWN, you’re fired.

Instead, we have a “comprehensive” (sorry, I meant “over-complicated”) set of layers: IAM policies, Bucket policies, Access Control Lists (ACLs), and then the Account-Level Block Public Access. It’s four locks on a door that’s made of glass.

The breach we had last week wasn’t a “sophisticated state-sponsored attack.” It was a dev who wanted to see if an image rendered correctly in a browser. They disabled “Block Public Access” because they didn’t want to deal with Pre-signed URLs. They thought, “I’ll just turn it off for five minutes.”

Five minutes is all it takes for a crawler to find an open bucket.

# Checking for public buckets before the auditors do
aws s3api get-public-access-block --bucket prod-customer-data-sensitive

{
    "PublicAccessBlockConfiguration": {
        "BlockPublicAcls": false,
        "IgnorePublicAcls": false,
        "BlockPublicPolicy": false,
        "RestrictPublicBuckets": false
    }
}

When I saw that output, I didn’t even get angry. I just felt a profound sense of exhaustion. We have Macie running, right? That’s what the brochure said. “Macie uses machine learning to protect your data.”

Do you know what Macie actually did? It generated a 400-page report telling us that we have “Sensitive Data” in our “Sensitive Data Bucket.” Thank you, Macie. That’ll be $2,000 for the discovery job. It’s like hiring a security guard who watches a thief walk out with your TV and then sends you a Slack message three hours later saying, “Hey, I noticed a high probability of TV-shaped objects leaving the premises.”

THE-IOPS-THROTTLING-SILENT-KILLER

In the old world, if my disk was slow, I checked the controller. I checked the cables. I looked at the actual spinning rust or the flash cells. In the cloud, your disk performance is a “credit balance.”

We had a production outage on Tuesday. Why? Because the gp2 volumes on our database ran out of “Burst Credits.” The application didn’t crash; it just slowed down to a crawl. 100 IOPS. Do you know what 100 IOPS feels like in 2024? It feels like trying to run a marathon through a vat of cold molasses.

# Checking the volume status while the site is down
aws ec2 describe-volumes --volume-ids vol-0123456789abcdef0 --query "Volumes[*].Iops"
[
    100
]

The fix, according to the “Cloud Architects,” was to migrate to gp3 and provision the IOPS. More money. Always more money. We’re paying for “Provisioned IOPS” now, which means we’re paying for performance that we might not even use, just so we don’t get throttled during a cron job. It’s a protection racket. “Nice database you got there. Shame if its throughput dropped to 1990s levels during your peak sales window.”

And let’s talk about cold storage. We moved our backups to Glacier Deep Archive. $0.00099 per GB. Sounds great on a spreadsheet. But have you tried to actually restore from it? It takes 12 hours just to “rehydrate” the data. If our primary site goes down, the business is dead for half a day while we wait for AWS to find our virtual tapes in their virtual basement. On-prem, I could have a tape in the drive and data flowing in ten minutes. Here, I’m at the mercy of a “retrieval tier.”

ROUTE53-IS-NOT-A-LOAD-BALANCER

I’ve had enough of people treating DNS like it’s a global traffic manager. Route53 is a fine DNS service, but the way we’ve implemented it is a disaster. We have “Health Checks” that cost $0.50 per endpoint per month. We have 500 microservices. Do the math. We are paying $250 a month just to have AWS ask our servers “Are you alive?” every 30 seconds.

And the blast radius of a Route53 mistake is terrifying. Last month, someone updated a weighted routing policy and accidentally sent 100% of our traffic to a “Coming Soon” bucket in us-west-2. Because of TTL (Time To Live) settings, that mistake stayed “live” for an hour after we fixed it. On-prem, I could clear the cache on the local resolvers. In the cloud, I just have to sit there and watch the “404 Not Found” errors spike on the dashboard while I contemplate my life choices.

# The Terraform change that killed the weekend
resource "aws_route53_record" "www" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "api.company.com"
  type    = "A"

  weighted_routing_policy {
    weight = 0 # Someone thought this meant "Primary"
  }
  set_identifier = "primary"
  alias {
    name                   = aws_lb.prod_lb.dns_name
    zone_id                = aws_lb.prod_lb.zone_id
    evaluate_target_health = true
  }
}

The complexity is the point. The more complex it is, the more “managed services” they can sell you to fix the complexity they created. It’s a self-licking ice cream cone.

THE-TERRAFORM-STATE-OF-DESPAIR

We were told that Infrastructure as Code would make everything “repeatable” and “safe.” Instead, it has just made it possible to delete the entire staging environment with a single terraform apply.

I spent four hours yesterday fixing a corrupted state file because two people tried to run a plan at the same time and the S3 bucket lock failed. We’re using Terraform 1.7.x, and while the removed blocks are nice, they don’t fix the fundamental problem: we are building a house of cards out of text files.

# Terraform 1.7.x - Trying to fix the mess without destroying the world
removed {
  from = aws_instance.unnecessary_gpu_beast

  lifecycle {
    destroy = false # PLEASE DO NOT ACTUALLY DELETE THE DATA
  }
}

Every time I run a terraform plan, my heart rate goes up to 110. I’m looking at 45 resources to be changed, 12 to be destroyed, and 3 to be added. Why is it destroying the database? “Oh, because you changed the name of the subnet, and that forces a replacement.”

A replacement. In the real world, if I rename a room, the furniture doesn’t spontaneously combust. In the cloud, if you change a tag or a name, sometimes the entire resource is vaporized and recreated. If you don’t have your “Deletion Protection” flags set—and let’s be honest, the devs never do—you’re one keystroke away from a resume-generating event.

RESIGNATION LETTER (DRAFT)

To: Human Resources / VP of Engineering
From: The Guy Who Remembers What a Serial Cable Looks Like
Subject: Moving to a cabin in the woods (where there is no Wi-Fi)

Effective two weeks from today, I am resigning from my position as Senior Systems Architect.

I’ve spent the last six months watching this company set fire to its capital in the name of “Cloud Transformation.” I have tried to explain that “Serverless” still uses servers, that “Elasticity” is just a fancy word for “Variable Billing,” and that “The Cloud” is just a way to outsource your competence to a third party that doesn’t care if your business fails.

Lessons Learned (The Hard Way):

The Bill is the Only Metric: We stopped caring about uptime and started caring about “Cost Optimization.” When your engineers spend 40% of their time looking at AWS Cost Explorer instead of writing code, you aren’t a tech company anymore; you’re an accounting firm with a hobby.
Hardware was Honest: If a drive failed in my rack, a red light turned on. I replaced the drive. Now, if a “Volume” fails, I have to open a support ticket and wait for a “Cloud Support Associate” to tell me that there was an “increased error rate in the underlying hardware.” Just say the disk died, Kevin. I know the disk died.
The Magic Wand is a Pipe: AWS is not a magic wand. It is a series of pipes, and those pipes are leaking money. We ignored the aws best practices of “least privilege” and “cost allocation tags” because they were “too slow.” Now we’re fast, but we’re broke.
Identity is a Nightmare: IAM is more complex than the actual applications we run. I shouldn’t need a PhD in Boolean logic to allow a Lambda function to write to a log group.
Networking is a Lost Art: Nobody knows what a subnet mask is anymore. Nobody understands BGP. They just click “Create Transit Gateway” and hope the “Cloud Magic” handles the routing. It doesn’t. It just charges you for the failure.

I’m going back to a world where “The Cloud” is something that brings rain, not something that brings a $250,000 invoice for “Unused Elastic IPs” and “NAT Gateway Data Processing.”

I’ve left the rack keys on my desk. Oh wait, we don’t have racks anymore. I’ve left my login credentials for the AWS Console in a secure vault. I suggest you delete the p4d.24xlarge instances before you can’t afford to pay my final paycheck.

Goodbye, and may your egress fees be low and your IOPS be plentiful. You’re going to need them.

Regards,

The Grumpy On-Prem Refugee

Explore more insights and best practices: