AWS Best Practices – Guide

AWS Best Practices: Why Your Infrastructure is a Money Pit and How to Stop the Bleeding

I once worked at a fintech startup where we “followed the docs” to the letter. We set up a multi-AZ deployment for a high-traffic microservice, thinking we were being smart about availability. Three days later, the Head of Engineering walked into the room with a face the color of a CloudWatch alarm. Our NAT Gateway bill for the weekend was $14,000. We weren’t even doing anything complex; we were just pulling 5GB Docker images from an external registry every time a pod scaled, and those bits were traveling over the public internet through a NAT Gateway. It was a $14k lesson in why “default” settings are a trap.

The problem is that AWS documentation is written by people who want you to use more AWS. They want you to click the “Enable” button on every managed service because it’s “seamless.” In reality, every “seamless” integration is a hidden line item on your bill or a new way for your system to fail at 3 AM. This isn’t about “unlocking potential.” This is about survival in a cloud environment that is designed to over-provision your wallet. The following aws best practices are born from scars, not slide decks.

1. IAM: Stop Using Managed Policies Immediately

AWS provides “Managed Policies” like AdministratorAccess or AmazonS3FullAccess. They are a security nightmare. If you attach AmazonS3FullAccess to a Lambda function, and that function has a remote code execution vulnerability, the attacker doesn’t just have your data; they have the ability to delete every bucket in your account. I’ve seen a disgruntled contractor wipe a staging environment because the “Dev” role had iam:* permissions “just in case.”

The “aws best” way to handle IAM is to write your own policies and use Condition blocks. If a service only needs to talk to one bucket, name that bucket. If it only needs to talk to that bucket from a specific VPC, enforce it.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3AccessToSpecificBucket",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::production-customer-data-12345/*",
            "Condition": {
                "StringEquals": {
                    "aws:SourceVpc": "vpc-0a1b2c3d4e5f6g7h8"
                }
            }
        }
    ]
}
  • Pro-tip: Use iam:PassRole restrictions. Without them, anyone with ec2:RunInstances can create a machine with an Admin role and escalate their privileges.
  • Note to self: Use aws-vault or AWS IAM Identity Center (formerly SSO). Never, ever let a developer put a .csv file with AKIA... keys on their local machine. It will end up in a public GitHub repo. It’s not a matter of if, but when.

2. The NAT Gateway Scam and VPC Endpoints

If you are using a NAT Gateway, you are likely being overcharged. AWS charges $0.045 per GB of data processed. This sounds small until you realize that your internal logs, your container image pulls, and your database backups are all hitting that gateway if you haven’t configured your routing correctly. I’ve seen companies spend more on NAT Gateways than on their actual EC2 compute.

The fix is VPC Endpoints. Specifically, Gateway Endpoints for S3 and DynamoDB. They are free. They don’t charge for data transfer. If you aren’t using them, you are literally throwing money at Amazon for no reason.

# Terraform snippet for an S3 Gateway Endpoint
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"
  route_table_ids = [aws_route_table.private.id]
}

For other services like Secrets Manager or EC2 APIs, use Interface Endpoints (PrivateLink). They aren’t free—they have an hourly cost—but the data processing fee is usually lower than a NAT Gateway for high-volume internal traffic. Stop letting your traffic loop out to the public internet just to talk to another AWS service in the same region.

3. Compute: Graviton and the gp3 Tax

If you are still running t3 or m5 instances, you are paying a legacy tax. AWS Graviton (the 6g, 7g series) is cheaper and faster for almost every workload. We migrated a fleet of Python-based microservices from m5.large to m6g.large and saw a 20% drop in latency and a 15% drop in cost. The only hurdle is your CI/CD pipeline needs to build for arm64.

Speaking of taxes, look at your EBS volumes. If they say gp2, change them to gp3 right now. gp2 ties your IOPS to the size of the disk. If you want 3,000 IOPS on gp2, you have to buy a 1TB disk. On gp3, you get 3,000 IOPS for free regardless of disk size, and the baseline price is 20% lower. It’s a literal “click to save money” button in the console.

Pro-tip: When moving to Graviton, you will hit the exec format error in your Docker containers. Use Docker Buildx to create multi-arch images. Don’t be the person who pushes an x86 image to an ARM node and wonders why the Kubelet is screaming.

# Build for both architectures
docker buildx build --platform linux/amd64,linux/arm64 -t 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.0.2 --push .

4. S3: Intelligent Tiering is Not a Silver Bullet

The “aws best” practice guides often scream “Use S3 Intelligent Tiering!” This is dangerous advice for people with millions of small objects. Intelligent Tiering has a monitoring fee per object ($0.0025 per 1,000 objects). If you have 100 million 10KB files, you will pay $250 a month just for AWS to “monitor” files that would have cost you $20 to store in Standard tier.

If your objects are smaller than 128KB, Intelligent Tiering will never move them to the Archive Instant Access tier anyway. You’re paying for a service that does nothing. For small objects, use Lifecycle Policies to transition them to ONEZONE_IA or GLACIER_INSTANT_RETRIEVAL after 30 days, or better yet, bundle small files into larger archives before uploading.

  • Rule 1: Objects < 128KB stay in Standard.
  • Rule 2: Use s3:PutObject and disable ACLs. Bucket owner enforced is the only way to stay sane.
  • Rule 3: Versioning is not a backup. If you delete a bucket, the versions go with it. Use Cross-Region Replication (CRR) if you actually care about the data.
  • Rule 4: Set a AbortIncompleteMultipartUpload lifecycle rule. I once found 4TB of “ghost” data in a bucket from failed uploads that we were being billed for.

5. Networking: The Cross-AZ Data Transfer Trap

AWS charges $0.01 per GB for data transferred between Availability Zones (AZs). This sounds negligible. It isn’t. If you have a Kafka cluster in us-east-1a and your consumers are in us-east-1b, you are paying that fee for every single message. If you move 100TB a month, that’s $1,000 just for the privilege of crossing a fiber optic cable in Northern Virginia.

To fix this, use “AZ-Aware” routing. In Kubernetes, use topologyKeys or TopologyAwareHints. For RDS, try to keep your application servers in the same AZ as the primary writer. Yes, you need multi-AZ for failover, but your “happy path” traffic should stay within the same zone whenever possible.

# Example Kubernetes Service with Topology Hints
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
  annotations:
    service.kubernetes.io/topology-mode: Auto
spec:
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

6. Observability: CloudWatch is a Money Pit

CloudWatch Logs is one of the most expensive ways to store text. At $0.50 per GB ingested, it’s a racket. I’ve seen systems where the logging cost exceeded the compute cost because some developer left DEBUG logging on in production. “aws best” practices suggest using CloudWatch for everything. I suggest you don’t.

Ingest your logs, but set a retention policy immediately. The default is “Never Expire.” That is a recipe for a bill that grows linearly until you die. Set it to 7 days, or 14 days. If you need long-term storage for compliance, use a Kinesis Firehose to stream those logs to S3 and query them with Athena. It is orders of magnitude cheaper.

Also, stop using CloudWatch Custom Metrics if you have high cardinality. If you start tracking customer_id as a dimension in CloudWatch, your bill will explode. Each unique combination of dimensions is a new metric, and AWS charges per metric. Use OpenTelemetry and send that data to a Prometheus instance or a specialized vendor who doesn’t charge per-metric-line-item.

7. The “Default VPC” Disaster

Every AWS account comes with a default VPC in every region. It has public subnets and an Internet Gateway. It is a playground for security breaches. The first thing I do in any new account is delete the default VPC. It forces you to actually think about your networking. If you don’t specify a subnet in your Terraform code, AWS will try to put it in the default VPC. If that VPC has an IGW, your “private” database might suddenly have a public IP address.

I once saw a “hidden” RDS instance that had been running in a default VPC for two years. It was wide open to 0.0.0.0/0 on port 5432. The only reason it wasn’t breached was that the password was 32 characters of random gibberish. We got lucky. Don’t rely on luck.

8. RDS: The Proxy and the Connection Limit

If you are using Lambda with RDS, you are going to have a bad time. Lambda scales horizontally so fast that it will exhaust the Postgres or MySQL connection pool in seconds. Each Lambda execution starts a new process, and if you aren’t careful, a new database connection. Postgres 14 on a t3.medium can’t handle 500 concurrent connections.

Use RDS Proxy. It sits between your Lambda and your DB, pooling connections and handling the “zombie” connections that Lambda leaves behind. It also makes failovers faster because the proxy handles the DNS switch, not your application code.

# Pro-tip: Check your DB connections with this SQL
SELECT count(*), state FROM pg_stat_activity GROUP BY state;

If you see hundreds of “idle” connections, your application is leaking them. RDS Proxy will mitigate this, but it won’t fix your bad code. Fix the code, then add the proxy as a safety net.

9. Terraform: The “State” of Chaos

Infrastructure as Code (IaC) is not optional. But “aws best” practices often ignore how to manage the state file. If you are keeping your terraform.tfstate on your local machine, you are one rm -rf away from a very bad week. Use an S3 backend with DynamoDB locking. This is non-negotiable for teams.

terraform {
  backend "s3" {
    bucket         = "my-company-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

And for the love of all that is holy, use lifecycle { prevent_destroy = true } on your VPCs, Databases, and S3 buckets. I’ve seen a terraform destroy meant for a dev environment accidentally pointed at production because of an incorrectly set AWS_PROFILE. The prevent_destroy flag is the only thing that will save your job in that scenario.

10. The Reality of EKS

Everyone wants Kubernetes. Most people don’t need it. EKS is a “managed” service that still requires you to manage the VPC CNI, the CoreDNS configuration, the storage classes, and the upgrade path for the control plane. It is not “set and forget.”

If you must use EKS, use Karpenter instead of the standard Cluster Autoscaler. The Cluster Autoscaler is slow; it waits for the AWS Auto Scaling Group to realize it needs a node, which then waits for the EC2 API. Karpenter talks directly to the EC2 Fleet API and can provision a node in sub-60 seconds. It also handles “bin-packing” better, moving your pods to the cheapest possible instance types automatically.

Pro-tip: EKS Fargate is a trap for high-throughput apps. You pay a premium for the “serverless” nature, but you lose control over networking (no DaemonSets) and it’s significantly more expensive than running managed node groups with Spot instances.

11. Cost Allocation: Tags or Death

If you don’t have a tagging policy, you don’t have a budget. You need to enforce tags at the IAM level. If a resource doesn’t have a Project, Environment, and Owner tag, the RunInstances call should fail. This is the only way to track down who is running that p3.16xlarge instance that’s been idle for three weeks.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceTagging",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "Null": {
          "aws:RequestTag/Project": "true"
        }
      }
    }
  ]
}

Once you have tags, use the AWS Cost and Usage Report (CUR). Cost Explorer is for managers. CUR + Athena is for SREs. It allows you to write SQL queries to find out exactly which S3 bucket is responsible for that $500 “Data Transfer” charge.

The “Gotcha” Only Experts Know: The us-east-1 Curse

Everyone defaults to us-east-1. It’s the oldest region. It has the most services. It also fails the most. When AWS has a global outage, it almost always starts in us-east-1. If you are building something new, go to us-east-2 (Ohio) or us-west-2 (Oregon). They are newer, the hardware is generally fresher, and they are significantly more stable. If you must stay in us-east-1, ensure your architecture is truly multi-AZ, because “AZ-1” in one account is not the same physical building as “AZ-1” in another account. Use AZ IDs (e.g., use1-az1) to map your infrastructure if you are doing cross-account networking.

AWS is a collection of primitives, not a finished product. The “best” way to use it is to treat every managed service with suspicion, every default setting as a potential bill spike, and every IAM permission as a potential security hole. Stop chasing the hype of new service releases and start focusing on the boring, fundamental work of locking down your network and optimizing your data transfer. Your on-call rotation and your company’s bank account will thank you.

Stop clicking buttons in the console. Write the code. Lock the state. Turn off the NAT Gateway. Go home.

Related Articles

Explore more insights and best practices:

Leave a Comment