10 Essential AWS Best Practices for Cloud Optimization

The smell of ozone. That’s what I miss. You don’t get that in the us-east-1 console. You get a loading spinner and a “Service Health Dashboard” that lies to your face while the world burns. Back in February of 2009, I was working in a colo facility in the basement of a converted textile mill in Chicago. It was negative twenty degrees outside, and the HVAC system for the server room decided that was the perfect moment to seize up. I was standing there in a Carhartt parka, my breath visible in the glow of the status LEDs, trying to figure out why a Dell PowerEdge 2950 was screaming like a banshee. It wasn’t the fans; it was the PERC 6/i RAID controller. The battery-backed cache had failed, and the write-through mode was dragging the entire SQL cluster into the dirt. Then, the smell hit me—the unmistakable scent of a capacitor popping on the backplane. I had to pull that three-unit beast out of the rack by myself, my fingers numb, while the blizzard rattled the industrial windows upstairs. I spent fourteen hours rebuilding that array from LTO-4 tapes that I’d hand-carried from a fireproof safe. There was no “auto-scaling.” There was no “self-healing.” There was just me, a crimping tool, a spare controller I scavenged from a decommissioned web server, and the cold reality of physical hardware.

That night taught me that everything eventually breaks, and it usually breaks when it’s most inconvenient for you. Now, they’ve dragged me into this “cloud” era, and everyone acts like the hardware doesn’t exist anymore. They call it “serverless,” which is the biggest load of marketing jank I’ve ever heard. It’s still a server; it’s just someone else’s server, and you’re paying a 400% markup for the privilege of not being able to touch it. When I look at a migration project for a legacy monolith—some spaghetti code mess written in Java 8 that expects a local mount point and a persistent IP—I don’t see “innovation.” I see a disaster waiting to happen in a multi-tenant environment. My scars from 2009 are why I look at these shiny new services with a squint and a sneer. I know that beneath the “aws best” marketing fluff, there’s a hypervisor somewhere that’s oversubscribed and a network switch that’s dropping packets. I’m writing this at 3:15 AM because the “elastic” load balancer decided it didn’t feel like being elastic today, and my coffee is the only thing keeping me from throwing this laptop through the window.

Table of Contents

The Myth of the Infinite Cloud

They tell the juniors that the cloud is infinite. They say you can just scale horizontally until the sun goes down. That is a lie. The cloud is a series of very specific, very rigid boxes called Service Quotas, and if you don’t know where the walls are, you’re going to crack your skull against them. I’ve seen teams try to follow some sanitized version of aws best practices by spinning up thousands of tiny containers, only to realize they’ve hit the API rate limit for the EC2 DescribeInstances call. Suddenly, their entire deployment pipeline grinds to a halt because they’re being throttled by the very provider they’re paying six figures a month to. You aren’t just fighting your own bugs; you’re fighting the “noisy neighbor” on the physical rack three states away and the arbitrary limits set by a bean counter in Seattle.

If you’re using the AWS CLI v2.15, you better get real comfortable with checking your limits before you start dreaming of “infinite” scale. You need to know exactly how many VPCs, EIPs, and running instances your account is allowed to have in a specific region. If you don’t, your “automated” infrastructure-as-code is going to barf a bunch of JSON errors at you right when you’re trying to push a critical patch. Use the following command to actually see what you’re up against before you start building your “vibrant” architecture that will inevitably fail:

# Checking service quotas for EC2 instances to avoid the "infinite" trap
aws service-quotas list-service-quotas \
    --service-code ec2 \
    --query "Quotas[?QuotaName=='Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances'].{Name:QuotaName, Value:Value, Code:QuotaCode}" \
    --output table \
    --region us-east-1

The “jank” here is that these quotas aren’t always updated in real-time. You might think you have room to grow, but then you hit a “ResourceLimitExceeded” error because the internal accounting hasn’t caught up to your recent deletions. It’s the same bit rot, just moved to a different layer of the stack. You have to treat the cloud like a crowded data center where you’ve only rented half a rack. You have to be stingy. You have to be skeptical. If you don’t account for throttling and quotas, your “highly available” system is just a very expensive way to show a 503 error to your users.

IAM: The Art of Saying ‘No’ Until it Works

Back in the day, I had a physical key to the cage. If you didn’t have the key, you didn’t touch the server. Now, we have IAM, a labyrinth of JSON policies that are so complex they practically require a PhD in Boolean logic to understand. People get lazy. They see the complexity and they just slap AdministratorAccess on everything because they want the “seamless” experience of things just working. That is how you end up with a crypto-miner running on a p4d.24xlarge instance that costs more than your mortgage. I hate writing IAM policies. It’s a tedious, soul-crushing exercise in trial and error, but it’s the only thing standing between you and a total account takeover.

You have to adopt a “deny by default” mindset. If a service doesn’t absolutely need to talk to another service, you shut that door and you bolt it. I don’t care if it makes the developers cry. I’ve seen what happens when a “robust” application has an SSRF vulnerability and the EC2 instance profile has s3:* permissions. It’s not pretty. You end up with your entire customer database on a public pastebin. When you’re trying to implement what the whitepapers call aws best architecture, you start with an empty policy and you add permissions one by one until the errors stop. It’s slow, it’s painful, and it’s the only way to sleep at night. Here is a snippet of what a “least privilege” policy actually looks like for a standard app server—none of that Resource: * garbage that the tutorials tell you to use.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RestrictiveS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::my-legacy-monolith-data-prod/*",
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalTag/Environment": "production"
                }
            }
        },
        {
            "Sid": "CloudWatchLogAccess",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

Notice the Condition block. If you aren’t using conditions, you aren’t doing IAM right. You’re just pretending. And don’t get me started on the “Confused Deputy” problem. If you’re not validating the ExternalId when you’re letting third-party SaaS tools into your account, you’re basically leaving the back door unlocked and putting a “Welcome” mat out for hackers. It’s all just cruft and complexity designed to hide the fact that security is hard and people are lazy.

VPCs and the Ghost of Subnets Past

I remember when a “network” was a bunch of Cat6 cables I crimped myself and a Cisco switch that smelled like warm electronics. Now, it’s a “Virtual Private Cloud,” which is just a fancy way of saying “software-defined networking that will charge you for every gigabyte that crosses an arbitrary boundary.” The CIDR math is the same, but the stakes are higher because every mistake has a dollar sign attached to it. The biggest scam in the modern cloud is the NAT Gateway. It’s a tax on the soul. You pay for the gateway to exist, and then you pay for every bit of data that passes through it. If you’re pulling a 50GB container image from a public registry through a NAT Gateway, you’re basically burning money to stay warm.

I’ve seen “cloud architects” who don’t know the difference between a public and a private subnet. They just put everything in a public subnet and use Security Groups to “secure” it. That’s like putting your server on the sidewalk and hoping nobody tries the door handle. If you’re ignoring the so-called aws best advice on VPC design, you’re going to end up with a flat network that is a playground for lateral movement. You need VPC Endpoints. You need to keep your traffic on the AWS backbone and off the public internet. It’s more “jank” to configure, but it saves you from the NAT Gateway tax. Use the CLI to find those idle gateways that are sucking your budget dry:

# Finding NAT Gateways that are just sitting there, costing money
aws ec2 describe-nat-gateways \
    --filter "Name=state,Values=available" \
    --query "NatGateways[?length(ProvisionedBandwidth) == \`0\`].{ID:NatGatewayId, Subnet:SubnetId, Created:CreateTime}" \
    --output table

And don’t even get me started on IPv6. Amazon Linux 2023 handles it better, but the legacy monolith we’re moving still thinks the world ends at 255.255.255.255. Trying to bridge that gap is like trying to teach a dog to play the violin. You end up with a mess of dual-stack configurations and routing table entries that look like spaghetti code. It’s all just layers of abstraction built on top of the same old Ethernet frames, and it’s getting harder to see the actual wire through all the “cloud-native” fog.

S3: It’s Not a Trash Can, It’s a Liability

S3 is the one thing I’ll grudgingly admit is impressive, but it’s also the most dangerous tool in the shed. People treat it like an infinite trash can for their “big data” (which is usually just 400GB of uncompressed logs that nobody will ever read). The problem is that S3 is a public-facing service by default if you click the wrong button. I’ve seen more data breaches caused by a “vibrant” developer making a bucket public “just for a second” than I have by actual sophisticated hacking. If you aren’t using “Block Public Access” at the account level, you’re asking for a PagerDuty alert at 4:00 AM.

The bit rot in S3 comes from the lack of lifecycle policies. People upload files and forget them. Ten years later, you’re paying for petabytes of data that hasn’t been accessed since the Obama administration. If you actually want to follow aws best standards, you need to enforce encryption at rest and use Object Lock for anything that needs to be immutable. And for the love of all that is holy, use versioning. I once saw a junior run a python 3.11 script with a bug that deleted the wrong prefix in a production bucket. If we hadn’t had versioning enabled, I’d still be in that data center trying to find the tapes.

# Enforcing a lifecycle policy to move old cruft to Glacier Deep Archive
aws s3api put-bucket-lifecycle-configuration \
    --bucket my-legacy-monolith-backups \
    --lifecycle-configuration '{
        "Rules": [
            {
                "ID": "MoveOldLogsToArchive",
                "Prefix": "logs/",
                "Status": "Enabled",
                "Transitions": [
                    {"Days": 90, "StorageClass": "GLACIER_IR"},
                    {"Days": 180, "StorageClass": "DEEP_ARCHIVE"}
                ],
                "Expiration": {"Days": 3650}
            }
        ]
    }'

Encryption is another point of pain. KMS is great until you hit the request limits because your app is calling Decrypt every time it reads a file. Then you’re throttled, your app hangs, and the “cloud” doesn’t look so “seamless” anymore. You have to cache your data keys. You have to understand the envelope encryption model. It’s not just “upload and forget.” It’s a constant battle against entropy and the rising cost of storage.

Monitoring: If it Doesn’t Wake You Up, It’s Useless

CloudWatch is a cruel joke. It’s a logging system designed by people who love looking at graphs but hate actually fixing things. The latency is the real killer. By the time a CloudWatch alarm triggers and sends a notification to SNS, which then triggers a Lambda, which then pings your Slack channel, your database has already been a smoking crater for five minutes. I miss the days of Nagios—at least I knew that if the light turned red, something was actually broken. Now, I have “AI-powered insights” telling me that my CPU usage is “anomalous” because I ran a cron job.

The “jank” in modern monitoring is the sheer volume of noise. We have metrics for everything, but visibility into nothing. You need to focus on the “Four Golden Signals,” but even then, you’re just guessing because you can’t see the underlying hardware. Is the disk slow because of your code, or because the EBS volume is being throttled on the IOPS you were too cheap to provision? You have to dig through the logs. Real logs. Not the summarized “insights” garbage. I use the CLI to tail logs because the web console is too slow to be useful during an actual incident.

# Tailing logs for a specific function to see the actual errors, not the marketing version
aws logs tail /aws/lambda/legacy-monolith-processor \
    --follow \
    --format short \
    --since 10m

If your monitoring doesn’t include a “dead man’s switch,” you don’t have monitoring. You have a historical record of your failures. I want to know when the heartbeat stops, not when the “vibrant” dashboard shows a 5% dip in throughput. And don’t get me started on the cost of custom metrics. You start pushing a few high-cardinality metrics and suddenly your CloudWatch bill is higher than your EC2 bill. It’s a racket. They charge you to tell you that the service you’re paying for is broken.

Cost Optimization: Paying for Air

The cloud is the only place where you pay for things you aren’t using. In my data center, if a server was off, it was just a hunk of metal taking up space. In AWS, if you leave an unattached EBS volume sitting there, or an idle NAT Gateway, or a bunch of “zombie” Elastic IPs, the meter keeps running. It’s “paying for air.” I’ve spent the last three weeks cleaning up the mess left behind by a “cloud-native” consultant who thought that “elasticity” meant “never delete anything.”

We’re running Amazon Linux 2023 now, which is fine, I guess, but it doesn’t change the fact that the underlying instances are often oversized. People pick an m5.xlarge because they’re too lazy to profile their code, and they end up using 5% of the CPU. That’s 95% waste. That’s money that could be going into my whiskey fund. You have to be ruthless. You have to use Spot instances for anything that isn’t mission-critical, and you have to use Savings Plans for the stuff that is. But even then, you’re just playing a game of “guess the capacity” with a provider that has all the cards.

The “cruft” accumulates faster than you think. You spin up a sandbox to test a new feature, you forget to delete the RDS snapshot, and six months later you’re wondering why your “storage” costs have doubled. There is no “seamless” way to manage this. It’s manual labor. It’s checking the billing dashboard every morning like a hawk. It’s writing scripts to find and kill the resources that are sucking the life out of your budget.

I just got a notification. The RDS instance in us-east-1 is reporting “Storage Full.” Of course it is. It’s 3:45 AM and the legacy monolith just decided to dump 200GB of “vibrant” debug logs into the database because a developer left a flag on. The “self-healing” storage hasn’t kicked in because we hit the maximum autoscaling limit I set to keep the bean counters happy. I have to go. The “cloud” is calling, and it sounds a lot like a server screaming in a blizzard. My coffee is cold. My eyes hurt. This is the life we chose.

Wait, the PagerDuty is going off again. It’s the “aws best” load balancer. It’s failing its health checks. I’m out.

Explore more insights and best practices: