AWS Best Practices: The Ultimate Guide to Cloud Success

text
$ kubectl get pods -n prod-core
NAME READY STATUS RESTARTS AGE
api-gateway-7f5d8b9-x2k9l 0/1 CrashLoopBackOff 42 14m
order-processor-0 0/1 Pending 0 72h
order-processor-1 0/1 Pending 0 72h
payment-service-5d4f9c8-m1n2b 0/1 CreateContainerConfigError 0 12m

$ kubectl describe pod order-processor-0
Events:
Type Reason Age From Message
—- —— —- —- ——-
Warning FailedScheduling 3m (x450 over 72h) default-scheduler 0/3 nodes are available: 3 node(s) had volume node affinity conflict.
Normal NotSpecified 12m ebs.csi.aws.com Failed to attach volume “vol-0a1b2c3d4e5f6g7h8”: rpc error: code = Internal desc = Could not attach volume “vol-0a1b2c3d4e5f6g7h8” to node “i-0987654321fedcba”: AccessDenied: User: arn:aws:sts::123456789012:assumed-role/eks-node-role/i-0987654321fedcba is not authorized to perform: ec2:AttachVolume on resource: arn:aws:ec2:us-east-1:123456789012:volume/vol-0a1b2c3d4e5f6g7h8

$ aws sts get-caller-identity –query “Arn” –output text
arn:aws:iam::123456789012:user/exhausted-sre-who-wants-to-go-home

I’m writing this because if I don’t, the "Cloud Architect" who caused this will probably get promoted for "proactive incident response" while I’m still picking the shrapnel of our production environment out of my teeth. We just spent 72 hours on a bridge call because someone thought a Medium article from 2019 was a valid replacement for reading the actual AWS documentation. 

Our "architecture" was a house of cards built on burstable instances, a misunderstanding of how physics works in a multi-AZ environment, and an IAM policy that was basically a "Please Hack Me" sign. If you’re reading this and you’re the one who committed that Terraform code, consider this your formal invitation to stay the hell away from my VPCs.

## The T3/T2 Burst Credit Trap: Why Your Production Cluster is a Zombie

Let’s start with the most basic, amateur-hour mistake: running production EKS worker nodes on `t3.medium` instances. I don’t care what the "aws best" practices guide for cost-optimization says in the "Getting Started" section; if you are running a stateful workload on a burstable instance, you are gambling with the stability of the entire stack.

Here is what happened: Kevin (our architect) saw that `t3.mediums` were cheaper than `m5.large` instances. He figured, "Hey, our average CPU utilization is only 15%, so we’ll save thousands." What he didn't account for—because he doesn't live in the real world—is the CPU Credit Balance. When the order-processor service hit a minor spike, the T3 instances exhausted their credits. In AWS-land, when a T3 runs out of credits, it doesn't just "slow down." It throttles you to a baseline performance that is so abysmal the Kubelet stops responding to the control plane.

The control plane, thinking the node is dead, marks it as `NotReady` and tries to reschedule 400 pods onto the remaining two nodes. Those nodes, already sweating, immediately burn through their own credit balances and die. It’s a cascading failure that no amount of "cloud-native" magic can fix.

If you’re running production, you use M5, C5, or R5 instances. Period. You want predictable performance. You don’t want your infrastructure to turn into a pumpkin at 3:00 AM because of a "credit exhaustion" event.

```bash
# How to check if your "architect" is sabotaging you with burstable instances
aws ec2 describe-instances \
    --filters "Name=instance-type,Values=t2.*,t3.*" \
    --query "Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,State:State.Name}" \
    --output table

If that command returns anything in your production account, you have a ticking time bomb. Switch to m5.large or higher. The $20 you save on the instance bill will be dwarfed by the $50,000 in lost revenue when the cluster enters a death spiral.

Table of Contents

The IAM Policy That Set the House on Fire

We spent six hours of the outage just trying to figure out why the EBS CSI driver couldn’t attach volumes. The error was AccessDenied. Simple, right? Just check the IAM role.

Well, Kevin decided that instead of using IAM Roles for Service Accounts (IRSA), he would just attach a “broad” policy to the EC2 instance profile. But he didn’t want to give it AdministratorAccess (thank god for small favors), so he tried to be “clever” and manually curated a list of permissions. He missed ec2:ModifyVolume and ec2:AttachVolume on specific resource ARNs because he didn’t understand that the CSI driver needs to talk to the EC2 API on behalf of the pod.

Here is the “battle-hardened” reality: Stop using instance profiles for EKS. Use IRSA. It maps an OIDC provider to a Kubernetes service account. It’s the only way to ensure that if one pod gets compromised, the attacker doesn’t have the keys to the entire kingdom.

But even with IRSA, people screw up the JSON. They use Resource: "*" because they’re lazy. In our case, the policy was so restrictive it was useless, then so broad it was dangerous. Here is what a real, functional, and secure EBS CSI policy looks like—not the garbage you find in a “Hello World” repo:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateSnapshot",
        "ec2:AttachVolume",
        "ec2:DetachVolume",
        "ec2:ModifyVolume",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeInstances",
        "ec2:DescribeSnapshots",
        "ec2:DescribeTags",
        "ec2:DescribeVolumes",
        "ec2:DescribeVolumesModifications"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-1"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateTags"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:volume/*",
        "arn:aws:ec2:*:*:snapshot/*"
      ],
      "Condition": {
        "StringEquals": {
          "ec2:CreateAction": [
            "CreateVolume",
            "CreateSnapshot"
          ]
        }
      }
    }
  ]
}

Notice the Condition block. If you aren’t using Condition keys in your IAM policies, you aren’t doing “aws best” practices; you’re just playing with matches. We restricted it to us-east-1 because we don’t operate in other regions. If an API call comes from us-west-2, I want it to fail hard.

Why Your Multi-AZ Strategy is a Lie

“We’re highly available! We’re across three Availability Zones!”

That’s what the slide deck said. But here’s the thing about EBS volumes: they are locked to a single Availability Zone. If your pod is in us-east-1a and it creates a Persistent Volume (PV), that PV lives in us-east-1a. If that AZ goes down, or if the node in that AZ dies and the scheduler tries to move your pod to us-east-1b, the pod will stay in Pending forever.

Why? Because a pod in 1b cannot mount a disk in 1a. Physics is a bitch.

Kevin’s “architecture” didn’t use VolumeBindingMode: WaitForFirstConsumer in the StorageClass. It used Immediate. So the PVCs were being provisioned in random AZs before the pods were even scheduled. We had pods in us-east-1a trying to reach across the data center to mount disks in us-east-1c.

The fix? You need to define your StorageClass properly. You need to tell Kubernetes to wait until it knows where the pod is going to live before it carves out the storage.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc-wait
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  throughput: "125"
  iops: "3000"

And for the love of all that is holy, use gp3. If I see another gp2 volume with its archaic IOPS-to-size ratio, I’m going to lose it. gp3 gives you 3,000 IOPS baseline regardless of size. Kevin had us on gp2 with 10GB volumes, which meant we were getting a pathetic 100 IOPS. No wonder the database was crawling.

The Cross-Region Data Transfer Tax: A $50,000 Lesson

During the bridge call, someone noticed the “Estimated Charges” widget in the Billing Dashboard was climbing faster than the heart rate of a junior dev who just ran rm -rf / on the production DB. We were hemorrhaging money.

The culprit? Cross-region data transfer. Kevin had set up a “Global” S3 bucket in us-west-2 but all our compute was in us-east-1. He also set up a VPC Peering connection to a legacy environment in eu-central-1 and was running a “sync” job every ten minutes.

AWS charges you for every byte that leaves a region. It also charges you for every byte that crosses an AZ boundary if you’re using a NAT Gateway. We were pushing terabytes of logs through a NAT Gateway in us-east-1a to an S3 bucket in another region.

The “aws best” way to handle this isn’t to just “be careful.” It’s to use VPC Endpoints. If you are hitting S3 or DynamoDB from within a VPC, you use a Gateway Endpoint. It’s free. It keeps the traffic on the AWS internal network. If you are hitting other services (like EC2, ELB, or Kinesis), you use an Interface Endpoint (PrivateLink). Yes, Interface Endpoints cost money per hour, but they are significantly cheaper than the $0.045 per GB you pay to shove that data through a NAT Gateway.

# Check if you're actually using VPC Endpoints or if you're burning money
aws ec2 describe-vpc-endpoints \
    --query "VpcEndpoints[*].{ID:VpcEndpointId,Service:ServiceName,Type:VpcEndpointType}" \
    --output table

If you don’t see com.amazonaws.us-east-1.s3 in that list, you are literally setting money on fire. We weren’t using them. Kevin said they were “too complex to manage via Terraform.” I’ve seen more complex things in a Lego set.

S3 Bucket Policies: Public by Default, Secure by Accident

While we were debugging the storage issues, we found out that the “backup” bucket Kevin created was wide open. Not “public” in the sense that the whole world could see it (thankfully the account-level “Block Public Access” was on), but the bucket policy was so poorly written that any IAM user in the entire organization could read our production database dumps.

This is the “standard” way people write S3 policies when they’re in a hurry:
"Principal": { "AWS": "*" } with a Condition that checks for the VPCId.

That is garbage. If someone creates a rogue EC2 instance in your VPC, they have access. If someone misconfigures a Lambda, they have access.

You need to use Deny statements. In IAM, an explicit Deny always wins. You should have a policy that denies everything unless it comes from a specific IAM role or a specific VPC Endpoint ID.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowSpecificRoleOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/backup-worker-role"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::prod-backups-deadly-serious/*"
        },
        {
            "Sid": "DenyNonVPCEndpointAccess",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::prod-backups-deadly-serious",
                "arn:aws:s3:::prod-backups-deadly-serious/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "aws:sourceVpce": "vpce-0a1b2c3d4e5f6g7h8"
                }
            }
        }
    ]
}

This policy says: “Only the backup-worker-role can touch this, and ONLY if the request comes through our specific VPC Endpoint.” That is how you sleep at night. Kevin didn’t do this. Kevin used a tutorial that told him to “just use the console to make it work.”

Service Quotas and the “Soft Limit” Myth

The final blow to our 72-hour marathon was when we tried to scale up the cluster to handle the backlog of 50 million stuck orders. We hit the L-12160769 quota. For those who don’t speak fluent AWS Quota-ese, that’s the “Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances” limit.

We needed 50 more nodes. AWS said “No.”

Kevin assumed that because we are a “big customer,” the limits would magically expand. They don’t. You have to request them. And during a regional outage or a high-demand period, those requests can take hours to process.

The “aws best” practice here is to use the Service Quotas API to monitor your usage against your limits before you need them. You should have CloudWatch Alarms on your quotas. If you’re at 80% of your instance limit, you should be alerted.

# How to see how close you are to failing
aws service-quotas get-service-quota \
    --service-code ec2 \
    --quota-code L-12160769 \
    --query "Quota.{Value:Value,Name:QuotaName}"

We were at 98/100. We needed 150. We sat there for four hours waiting for a support engineer in Dublin to click “Approve” while the business lost $10,000 a minute.

The Reality of StatefulSets in EKS

Let’s talk about the order-processor service. It’s a StatefulSet. Kevin followed a tutorial that used emptyDir for “testing” and then just… left it that way for production.

When the nodes started cycling because of the CPU credit exhaustion, the order-processor pods died. Because they were using emptyDir, all the local state—the buffer of orders that hadn’t been committed to the DB yet—was wiped out.

If you are running a StatefulSet, you use volumeClaimTemplates. You don’t use hostPath, and you certainly don’t use emptyDir for anything you care about. You also need to set a podAntiAffinity rule. Kevin had all three replicas of the order-processor running on the same t3.medium node. When that node died, the entire service went dark.

A battle-hardened StatefulSet looks like this:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - order-processor
        topologyKey: "kubernetes.io/hostname"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "ebs-sc-wait"
      resources:
        requests:
          storage: 100Gi

This ensures that Kubernetes spreads your pods across different nodes. If one node catches fire, the others survive. It’s not rocket science; it’s just basic operational competence.

Checklist for the Cynical

I’m going to go sleep for 14 hours. When I come back, if I see any of these issues in the dev environment, I’m revoking everyone’s IAM permissions and we’re going back to deploying via FTP on a single VPS.

Here is what you check before you even think about tagging me in a PR:

Instance Types: Are you using t2 or t3 in production? If yes, delete it. Use m5 or c5. No exceptions.
EBS CSI Driver: Is your StorageClass set to volumeBindingMode: WaitForFirstConsumer? If it’s Immediate, you’re going to have AZ affinity nightmares.
VPC Endpoints: Do you have a Gateway Endpoint for S3? Check the routing table. If the traffic to S3 is going through a NAT Gateway, you are failing.
IAM IRSA: Are you using instance profiles for pods? Stop. Use OIDC and IRSA. Map the role to the ServiceAccount.
S3 Deny Policies: Does your bucket policy have an explicit Deny for any traffic not originating from your VPC Endpoint? If not, your data is one misconfiguration away from being leaked.
Service Quotas: Have you checked your limits lately? Use the CLI. Don’t wait for the InstanceLimitExceeded error to find out you’re capped.
GP3 Volumes: Are you still using gp2? Why? gp3 is cheaper and faster. Change the type in your Terraform and move on.
NAT Gateway Costs: Check your Cost Explorer. Group by “Usage Type.” If “DataTransfer-Regional-Bytes” is in your top 5, your architecture is inefficient and you’re wasting the company’s money.
Anti-Affinity: Are your critical pods spread across nodes and AZs? If kubectl get pods -o wide shows all your replicas on the same node, you don’t have HA; you have a suicide pact.
Logging: Are you sending logs to CloudWatch via the public internet? Use a VPC Endpoint for logs.us-east-1.amazonaws.com.

The “aws best” way to build things isn’t the way that looks good in a marketing brochure. It’s the way that survives when the us-east-1 API starts throwing 500 errors and the network latency spikes to 500ms. It’s ugly, it’s verbose, and it requires you to actually understand the underlying infrastructure.

Now, if you’ll excuse me, I have to go explain to the CTO why we spent $50k on “data transfer” because Kevin didn’t know what a VPC Endpoint was.

Post-Mortem Status: Closed.
System Health: Fragile.
SRE Mood: Resigned.

Explore more insights and best practices: