{"id":4496,"date":"2026-02-08T21:05:53","date_gmt":"2026-02-08T15:35:53","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/"},"modified":"2026-02-08T21:05:53","modified_gmt":"2026-02-08T15:35:53","slug":"aws-best-practices-the-ultimate-guide-to-cloud-success","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/","title":{"rendered":"AWS Best Practices: The Ultimate Guide to Cloud Success"},"content":{"rendered":"<p>text<br \/>\n$ kubectl get pods -n prod-core<br \/>\nNAME                                     READY   STATUS                            RESTARTS   AGE<br \/>\napi-gateway-7f5d8b9-x2k9l                0\/1     CrashLoopBackOff                  42         14m<br \/>\norder-processor-0                        0\/1     Pending                           0          72h<br \/>\norder-processor-1                        0\/1     Pending                           0          72h<br \/>\npayment-service-5d4f9c8-m1n2b            0\/1     CreateContainerConfigError        0          12m<\/p>\n<p>$ kubectl describe pod order-processor-0<br \/>\nEvents:<br \/>\n  Type     Reason            Age                  From               Message<br \/>\n  &#8212;-     &#8212;&#8212;            &#8212;-                 &#8212;-               &#8212;&#8212;-<br \/>\n  Warning  FailedScheduling  3m (x450 over 72h)   default-scheduler  0\/3 nodes are available: 3 node(s) had volume node affinity conflict.<br \/>\n  Normal   NotSpecified      12m                  ebs.csi.aws.com    Failed to attach volume &#8220;vol-0a1b2c3d4e5f6g7h8&#8221;: rpc error: code = Internal desc = Could not attach volume &#8220;vol-0a1b2c3d4e5f6g7h8&#8221; to node &#8220;i-0987654321fedcba&#8221;: AccessDenied: User: arn:aws:sts::123456789012:assumed-role\/eks-node-role\/i-0987654321fedcba is not authorized to perform: ec2:AttachVolume on resource: arn:aws:ec2:us-east-1:123456789012:volume\/vol-0a1b2c3d4e5f6g7h8<\/p>\n<p>$ aws sts get-caller-identity &#8211;query &#8220;Arn&#8221; &#8211;output text<br \/>\narn:aws:iam::123456789012:user\/exhausted-sre-who-wants-to-go-home<\/p>\n<pre class=\"codehilite\"><code>I\u2019m writing this because if I don\u2019t, the &quot;Cloud Architect&quot; who caused this will probably get promoted for &quot;proactive incident response&quot; while I\u2019m still picking the shrapnel of our production environment out of my teeth. We just spent 72 hours on a bridge call because someone thought a Medium article from 2019 was a valid replacement for reading the actual AWS documentation. \n\nOur &quot;architecture&quot; was a house of cards built on burstable instances, a misunderstanding of how physics works in a multi-AZ environment, and an IAM policy that was basically a &quot;Please Hack Me&quot; sign. If you\u2019re reading this and you\u2019re the one who committed that Terraform code, consider this your formal invitation to stay the hell away from my VPCs.\n\n## The T3\/T2 Burst Credit Trap: Why Your Production Cluster is a Zombie\n\nLet\u2019s start with the most basic, amateur-hour mistake: running production EKS worker nodes on `t3.medium` instances. I don\u2019t care what the &quot;aws best&quot; practices guide for cost-optimization says in the &quot;Getting Started&quot; section; if you are running a stateful workload on a burstable instance, you are gambling with the stability of the entire stack.\n\nHere is what happened: Kevin (our architect) saw that `t3.mediums` were cheaper than `m5.large` instances. He figured, &quot;Hey, our average CPU utilization is only 15%, so we\u2019ll save thousands.&quot; What he didn't account for\u2014because he doesn't live in the real world\u2014is the CPU Credit Balance. When the order-processor service hit a minor spike, the T3 instances exhausted their credits. In AWS-land, when a T3 runs out of credits, it doesn't just &quot;slow down.&quot; It throttles you to a baseline performance that is so abysmal the Kubelet stops responding to the control plane.\n\nThe control plane, thinking the node is dead, marks it as `NotReady` and tries to reschedule 400 pods onto the remaining two nodes. Those nodes, already sweating, immediately burn through their own credit balances and die. It\u2019s a cascading failure that no amount of &quot;cloud-native&quot; magic can fix.\n\nIf you\u2019re running production, you use M5, C5, or R5 instances. Period. You want predictable performance. You don\u2019t want your infrastructure to turn into a pumpkin at 3:00 AM because of a &quot;credit exhaustion&quot; event.\n\n```bash\n# How to check if your &quot;architect&quot; is sabotaging you with burstable instances\naws ec2 describe-instances \\\n    --filters &quot;Name=instance-type,Values=t2.*,t3.*&quot; \\\n    --query &quot;Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,State:State.Name}&quot; \\\n    --output table\n<\/code><\/pre>\n<p>If that command returns anything in your production account, you have a ticking time bomb. Switch to <code>m5.large<\/code> or higher. The $20 you save on the instance bill will be dwarfed by the $50,000 in lost revenue when the cluster enters a death spiral.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69e209ca8258f\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69e209ca8258f\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#The_IAM_Policy_That_Set_the_House_on_Fire\" >The IAM Policy That Set the House on Fire<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#Why_Your_Multi-AZ_Strategy_is_a_Lie\" >Why Your Multi-AZ Strategy is a Lie<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#The_Cross-Region_Data_Transfer_Tax_A_50000_Lesson\" >The Cross-Region Data Transfer Tax: A $50,000 Lesson<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#S3_Bucket_Policies_Public_by_Default_Secure_by_Accident\" >S3 Bucket Policies: Public by Default, Secure by Accident<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#Service_Quotas_and_the_%E2%80%9CSoft_Limit%E2%80%9D_Myth\" >Service Quotas and the &#8220;Soft Limit&#8221; Myth<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#The_Reality_of_StatefulSets_in_EKS\" >The Reality of StatefulSets in EKS<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#Checklist_for_the_Cynical\" >Checklist for the Cynical<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"The_IAM_Policy_That_Set_the_House_on_Fire\"><\/span>The IAM Policy That Set the House on Fire<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We spent six hours of the outage just trying to figure out why the EBS CSI driver couldn&#8217;t attach volumes. The error was <code>AccessDenied<\/code>. Simple, right? Just check the IAM role. <\/p>\n<p>Well, Kevin decided that instead of using IAM Roles for Service Accounts (IRSA), he would just attach a &#8220;broad&#8221; policy to the EC2 instance profile. But he didn&#8217;t want to give it <code>AdministratorAccess<\/code> (thank god for small favors), so he tried to be &#8220;clever&#8221; and manually curated a list of permissions. He missed <code>ec2:ModifyVolume<\/code> and <code>ec2:AttachVolume<\/code> on specific resource ARNs because he didn&#8217;t understand that the CSI driver needs to talk to the EC2 API on behalf of the pod.<\/p>\n<p>Here is the &#8220;battle-hardened&#8221; reality: Stop using instance profiles for EKS. Use IRSA. It maps an OIDC provider to a Kubernetes service account. It\u2019s the only way to ensure that if one pod gets compromised, the attacker doesn&#8217;t have the keys to the entire kingdom.<\/p>\n<p>But even with IRSA, people screw up the JSON. They use <code>Resource: \"*\"<\/code> because they\u2019re lazy. In our case, the policy was so restrictive it was useless, then so broad it was dangerous. Here is what a real, functional, and secure EBS CSI policy looks like\u2014not the garbage you find in a &#8220;Hello World&#8221; repo:<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n  &quot;Version&quot;: &quot;2012-10-17&quot;,\n  &quot;Statement&quot;: [\n    {\n      &quot;Effect&quot;: &quot;Allow&quot;,\n      &quot;Action&quot;: [\n        &quot;ec2:CreateSnapshot&quot;,\n        &quot;ec2:AttachVolume&quot;,\n        &quot;ec2:DetachVolume&quot;,\n        &quot;ec2:ModifyVolume&quot;,\n        &quot;ec2:DescribeAvailabilityZones&quot;,\n        &quot;ec2:DescribeInstances&quot;,\n        &quot;ec2:DescribeSnapshots&quot;,\n        &quot;ec2:DescribeTags&quot;,\n        &quot;ec2:DescribeVolumes&quot;,\n        &quot;ec2:DescribeVolumesModifications&quot;\n      ],\n      &quot;Resource&quot;: &quot;*&quot;,\n      &quot;Condition&quot;: {\n        &quot;StringEquals&quot;: {\n          &quot;aws:RequestedRegion&quot;: &quot;us-east-1&quot;\n        }\n      }\n    },\n    {\n      &quot;Effect&quot;: &quot;Allow&quot;,\n      &quot;Action&quot;: [\n        &quot;ec2:CreateTags&quot;\n      ],\n      &quot;Resource&quot;: [\n        &quot;arn:aws:ec2:*:*:volume\/*&quot;,\n        &quot;arn:aws:ec2:*:*:snapshot\/*&quot;\n      ],\n      &quot;Condition&quot;: {\n        &quot;StringEquals&quot;: {\n          &quot;ec2:CreateAction&quot;: [\n            &quot;CreateVolume&quot;,\n            &quot;CreateSnapshot&quot;\n          ]\n        }\n      }\n    }\n  ]\n}\n<\/code><\/pre>\n<p>Notice the <code>Condition<\/code> block. If you aren&#8217;t using <code>Condition<\/code> keys in your IAM policies, you aren&#8217;t doing &#8220;aws best&#8221; practices; you&#8217;re just playing with matches. We restricted it to <code>us-east-1<\/code> because we don&#8217;t operate in other regions. If an API call comes from <code>us-west-2<\/code>, I want it to fail hard.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Why_Your_Multi-AZ_Strategy_is_a_Lie\"><\/span>Why Your Multi-AZ Strategy is a Lie<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>&#8220;We\u2019re highly available! We\u2019re across three Availability Zones!&#8221; <\/p>\n<p>That\u2019s what the slide deck said. But here\u2019s the thing about EBS volumes: they are locked to a single Availability Zone. If your pod is in <code>us-east-1a<\/code> and it creates a Persistent Volume (PV), that PV lives in <code>us-east-1a<\/code>. If that AZ goes down, or if the node in that AZ dies and the scheduler tries to move your pod to <code>us-east-1b<\/code>, the pod will stay in <code>Pending<\/code> forever. <\/p>\n<p>Why? Because a pod in <code>1b<\/code> cannot mount a disk in <code>1a<\/code>. Physics is a bitch.<\/p>\n<p>Kevin\u2019s &#8220;architecture&#8221; didn&#8217;t use <code>VolumeBindingMode: WaitForFirstConsumer<\/code> in the StorageClass. It used <code>Immediate<\/code>. So the PVCs were being provisioned in random AZs before the pods were even scheduled. We had pods in <code>us-east-1a<\/code> trying to reach across the data center to mount disks in <code>us-east-1c<\/code>. <\/p>\n<p>The fix? You need to define your StorageClass properly. You need to tell Kubernetes to wait until it knows where the pod is going to live before it carves out the storage.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">apiVersion: storage.k8s.io\/v1\nkind: StorageClass\nmetadata:\n  name: ebs-sc-wait\nprovisioner: ebs.csi.aws.com\nreclaimPolicy: Delete\nvolumeBindingMode: WaitForFirstConsumer\nparameters:\n  type: gp3\n  throughput: &quot;125&quot;\n  iops: &quot;3000&quot;\n<\/code><\/pre>\n<p>And for the love of all that is holy, use <code>gp3<\/code>. If I see another <code>gp2<\/code> volume with its archaic IOPS-to-size ratio, I\u2019m going to lose it. <code>gp3<\/code> gives you 3,000 IOPS baseline regardless of size. Kevin had us on <code>gp2<\/code> with 10GB volumes, which meant we were getting a pathetic 100 IOPS. No wonder the database was crawling.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Cross-Region_Data_Transfer_Tax_A_50000_Lesson\"><\/span>The Cross-Region Data Transfer Tax: A $50,000 Lesson<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>During the bridge call, someone noticed the &#8220;Estimated Charges&#8221; widget in the Billing Dashboard was climbing faster than the heart rate of a junior dev who just ran <code>rm -rf \/<\/code> on the production DB. We were hemorrhaging money.<\/p>\n<p>The culprit? Cross-region data transfer. Kevin had set up a &#8220;Global&#8221; S3 bucket in <code>us-west-2<\/code> but all our compute was in <code>us-east-1<\/code>. He also set up a VPC Peering connection to a legacy environment in <code>eu-central-1<\/code> and was running a &#8220;sync&#8221; job every ten minutes.<\/p>\n<p>AWS charges you for every byte that leaves a region. It also charges you for every byte that crosses an AZ boundary if you\u2019re using a NAT Gateway. We were pushing terabytes of logs through a NAT Gateway in <code>us-east-1a<\/code> to an S3 bucket in another region. <\/p>\n<p>The &#8220;aws best&#8221; way to handle this isn&#8217;t to just &#8220;be careful.&#8221; It\u2019s to use VPC Endpoints. If you are hitting S3 or DynamoDB from within a VPC, you use a <strong>Gateway Endpoint<\/strong>. It\u2019s free. It keeps the traffic on the AWS internal network. If you are hitting other services (like EC2, ELB, or Kinesis), you use an <strong>Interface Endpoint<\/strong> (PrivateLink). Yes, Interface Endpoints cost money per hour, but they are significantly cheaper than the $0.045 per GB you pay to shove that data through a NAT Gateway.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Check if you're actually using VPC Endpoints or if you're burning money\naws ec2 describe-vpc-endpoints \\\n    --query &quot;VpcEndpoints[*].{ID:VpcEndpointId,Service:ServiceName,Type:VpcEndpointType}&quot; \\\n    --output table\n<\/code><\/pre>\n<p>If you don&#8217;t see <code>com.amazonaws.us-east-1.s3<\/code> in that list, you are literally setting money on fire. We weren&#8217;t using them. Kevin said they were &#8220;too complex to manage via Terraform.&#8221; I\u2019ve seen more complex things in a Lego set.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"S3_Bucket_Policies_Public_by_Default_Secure_by_Accident\"><\/span>S3 Bucket Policies: Public by Default, Secure by Accident<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>While we were debugging the storage issues, we found out that the &#8220;backup&#8221; bucket Kevin created was wide open. Not &#8220;public&#8221; in the sense that the whole world could see it (thankfully the account-level &#8220;Block Public Access&#8221; was on), but the bucket policy was so poorly written that any IAM user in the entire organization could read our production database dumps.<\/p>\n<p>This is the &#8220;standard&#8221; way people write S3 policies when they&#8217;re in a hurry:<br \/>\n<code>\"Principal\": { \"AWS\": \"*\" }<\/code> with a <code>Condition<\/code> that checks for the <code>VPCId<\/code>. <\/p>\n<p>That is garbage. If someone creates a rogue EC2 instance in your VPC, they have access. If someone misconfigures a Lambda, they have access. <\/p>\n<p>You need to use <code>Deny<\/code> statements. In IAM, an explicit <code>Deny<\/code> always wins. You should have a policy that denies everything unless it comes from a specific IAM role or a specific VPC Endpoint ID.<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n    &quot;Version&quot;: &quot;2012-10-17&quot;,\n    &quot;Statement&quot;: [\n        {\n            &quot;Sid&quot;: &quot;AllowSpecificRoleOnly&quot;,\n            &quot;Effect&quot;: &quot;Allow&quot;,\n            &quot;Principal&quot;: {\n                &quot;AWS&quot;: &quot;arn:aws:iam::123456789012:role\/backup-worker-role&quot;\n            },\n            &quot;Action&quot;: [\n                &quot;s3:GetObject&quot;,\n                &quot;s3:PutObject&quot;\n            ],\n            &quot;Resource&quot;: &quot;arn:aws:s3:::prod-backups-deadly-serious\/*&quot;\n        },\n        {\n            &quot;Sid&quot;: &quot;DenyNonVPCEndpointAccess&quot;,\n            &quot;Effect&quot;: &quot;Deny&quot;,\n            &quot;Principal&quot;: &quot;*&quot;,\n            &quot;Action&quot;: &quot;s3:*&quot;,\n            &quot;Resource&quot;: [\n                &quot;arn:aws:s3:::prod-backups-deadly-serious&quot;,\n                &quot;arn:aws:s3:::prod-backups-deadly-serious\/*&quot;\n            ],\n            &quot;Condition&quot;: {\n                &quot;StringNotEquals&quot;: {\n                    &quot;aws:sourceVpce&quot;: &quot;vpce-0a1b2c3d4e5f6g7h8&quot;\n                }\n            }\n        }\n    ]\n}\n<\/code><\/pre>\n<p>This policy says: &#8220;Only the backup-worker-role can touch this, and ONLY if the request comes through our specific VPC Endpoint.&#8221; That is how you sleep at night. Kevin didn&#8217;t do this. Kevin used a tutorial that told him to &#8220;just use the console to make it work.&#8221;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Service_Quotas_and_the_%E2%80%9CSoft_Limit%E2%80%9D_Myth\"><\/span>Service Quotas and the &#8220;Soft Limit&#8221; Myth<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The final blow to our 72-hour marathon was when we tried to scale up the cluster to handle the backlog of 50 million stuck orders. We hit the <code>L-12160769<\/code> quota. For those who don&#8217;t speak fluent AWS Quota-ese, that\u2019s the &#8220;Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances&#8221; limit.<\/p>\n<p>We needed 50 more nodes. AWS said &#8220;No.&#8221; <\/p>\n<p>Kevin assumed that because we are a &#8220;big customer,&#8221; the limits would magically expand. They don&#8217;t. You have to request them. And during a regional outage or a high-demand period, those requests can take hours to process.<\/p>\n<p>The &#8220;aws best&#8221; practice here is to use the Service Quotas API to monitor your usage against your limits <em>before<\/em> you need them. You should have CloudWatch Alarms on your quotas. If you\u2019re at 80% of your instance limit, you should be alerted.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># How to see how close you are to failing\naws service-quotas get-service-quota \\\n    --service-code ec2 \\\n    --quota-code L-12160769 \\\n    --query &quot;Quota.{Value:Value,Name:QuotaName}&quot;\n<\/code><\/pre>\n<p>We were at 98\/100. We needed 150. We sat there for four hours waiting for a support engineer in Dublin to click &#8220;Approve&#8221; while the business lost $10,000 a minute. <\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Reality_of_StatefulSets_in_EKS\"><\/span>The Reality of StatefulSets in EKS<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Let\u2019s talk about the <code>order-processor<\/code> service. It\u2019s a StatefulSet. Kevin followed a tutorial that used <code>emptyDir<\/code> for &#8220;testing&#8221; and then just&#8230; left it that way for production. <\/p>\n<p>When the nodes started cycling because of the CPU credit exhaustion, the <code>order-processor<\/code> pods died. Because they were using <code>emptyDir<\/code>, all the local state\u2014the buffer of orders that hadn&#8217;t been committed to the DB yet\u2014was wiped out. <\/p>\n<p>If you are running a StatefulSet, you use <code>volumeClaimTemplates<\/code>. You don&#8217;t use <code>hostPath<\/code>, and you certainly don&#8217;t use <code>emptyDir<\/code> for anything you care about. You also need to set a <code>podAntiAffinity<\/code> rule. Kevin had all three replicas of the <code>order-processor<\/code> running on the same <code>t3.medium<\/code> node. When that node died, the entire service went dark.<\/p>\n<p>A battle-hardened StatefulSet looks like this:<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">spec:\n  affinity:\n    podAntiAffinity:\n      requiredDuringSchedulingIgnoredDuringExecution:\n      - labelSelector:\n          matchExpressions:\n          - key: app\n            operator: In\n            values:\n            - order-processor\n        topologyKey: &quot;kubernetes.io\/hostname&quot;\n  volumeClaimTemplates:\n  - metadata:\n      name: data\n    spec:\n      accessModes: [ &quot;ReadWriteOnce&quot; ]\n      storageClassName: &quot;ebs-sc-wait&quot;\n      resources:\n        requests:\n          storage: 100Gi\n<\/code><\/pre>\n<p>This ensures that Kubernetes spreads your pods across different nodes. If one node catches fire, the others survive. It\u2019s not rocket science; it\u2019s just basic operational competence.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Checklist_for_the_Cynical\"><\/span>Checklist for the Cynical<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I\u2019m going to go sleep for 14 hours. When I come back, if I see any of these issues in the <code>dev<\/code> environment, I\u2019m revoking everyone\u2019s IAM permissions and we\u2019re going back to deploying via FTP on a single VPS. <\/p>\n<p>Here is what you check before you even think about tagging me in a PR:<\/p>\n<ol>\n<li><strong>Instance Types:<\/strong> Are you using <code>t2<\/code> or <code>t3<\/code> in production? If yes, delete it. Use <code>m5<\/code> or <code>c5<\/code>. No exceptions.<\/li>\n<li><strong>EBS CSI Driver:<\/strong> Is your StorageClass set to <code>volumeBindingMode: WaitForFirstConsumer<\/code>? If it\u2019s <code>Immediate<\/code>, you\u2019re going to have AZ affinity nightmares.<\/li>\n<li><strong>VPC Endpoints:<\/strong> Do you have a Gateway Endpoint for S3? Check the routing table. If the traffic to S3 is going through a NAT Gateway, you are failing.<\/li>\n<li><strong>IAM IRSA:<\/strong> Are you using instance profiles for pods? Stop. Use OIDC and IRSA. Map the role to the ServiceAccount.<\/li>\n<li><strong>S3 Deny Policies:<\/strong> Does your bucket policy have an explicit <code>Deny<\/code> for any traffic not originating from your VPC Endpoint? If not, your data is one misconfiguration away from being leaked.<\/li>\n<li><strong>Service Quotas:<\/strong> Have you checked your limits lately? Use the CLI. Don&#8217;t wait for the <code>InstanceLimitExceeded<\/code> error to find out you&#8217;re capped.<\/li>\n<li><strong>GP3 Volumes:<\/strong> Are you still using <code>gp2<\/code>? Why? <code>gp3<\/code> is cheaper and faster. Change the <code>type<\/code> in your Terraform and move on.<\/li>\n<li><strong>NAT Gateway Costs:<\/strong> Check your Cost Explorer. Group by &#8220;Usage Type.&#8221; If &#8220;DataTransfer-Regional-Bytes&#8221; is in your top 5, your architecture is inefficient and you\u2019re wasting the company\u2019s money.<\/li>\n<li><strong>Anti-Affinity:<\/strong> Are your critical pods spread across nodes and AZs? If <code>kubectl get pods -o wide<\/code> shows all your replicas on the same node, you don&#8217;t have HA; you have a suicide pact.<\/li>\n<li><strong>Logging:<\/strong> Are you sending logs to CloudWatch via the public internet? Use a VPC Endpoint for <code>logs.us-east-1.amazonaws.com<\/code>. <\/li>\n<\/ol>\n<p>The &#8220;aws best&#8221; way to build things isn&#8217;t the way that looks good in a marketing brochure. It\u2019s the way that survives when the us-east-1 API starts throwing 500 errors and the network latency spikes to 500ms. It\u2019s ugly, it\u2019s verbose, and it requires you to actually understand the underlying infrastructure. <\/p>\n<p>Now, if you\u2019ll excuse me, I have to go explain to the CTO why we spent $50k on &#8220;data transfer&#8221; because Kevin didn&#8217;t know what a VPC Endpoint was. <\/p>\n<p><strong>Post-Mortem Status: Closed.<\/strong><br \/>\n<strong>System Health: Fragile.<\/strong><br \/>\n<strong>SRE Mood: Resigned.<\/strong><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/how-to-install-asterisk-16-on-ubuntu-18-04-lts\/\">How To Install Asterisk 16 On Ubuntu 18 04 Lts<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/exclusive-nokia-tv-as-flipkarts-unique-product-looks-and-features\/\">Exclusive Nokia Tv As Flipkarts Unique Product Looks And Features<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/everything-you-need-to-know-about-xiaomi-hifi-hybrid-flagship-earphones\/\">Everything You Need To Know About Xiaomi Hifi Hybrid Flagship Earphones<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>text $ kubectl get pods -n prod-core NAME READY STATUS RESTARTS AGE api-gateway-7f5d8b9-x2k9l 0\/1 CrashLoopBackOff 42 14m order-processor-0 0\/1 Pending 0 72h order-processor-1 0\/1 Pending 0 72h payment-service-5d4f9c8-m1n2b 0\/1 CreateContainerConfigError 0 12m $ kubectl describe pod order-processor-0 Events: Type Reason Age From Message &#8212;- &#8212;&#8212; &#8212;- &#8212;- &#8212;&#8212;- Warning FailedScheduling 3m (x450 over 72h) default-scheduler &#8230; <a title=\"AWS Best Practices: The Ultimate Guide to Cloud Success\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\" aria-label=\"Read more  on AWS Best Practices: The Ultimate Guide to Cloud Success\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4496","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>AWS Best Practices: The Ultimate Guide to Cloud Success - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"AWS Best Practices: The Ultimate Guide to Cloud Success - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"text $ kubectl get pods -n prod-core NAME READY STATUS RESTARTS AGE api-gateway-7f5d8b9-x2k9l 0\/1 CrashLoopBackOff 42 14m order-processor-0 0\/1 Pending 0 72h order-processor-1 0\/1 Pending 0 72h payment-service-5d4f9c8-m1n2b 0\/1 CreateContainerConfigError 0 12m $ kubectl describe pod order-processor-0 Events: Type Reason Age From Message &#8212;- &#8212;&#8212; &#8212;- &#8212;- &#8212;&#8212;- Warning FailedScheduling 3m (x450 over 72h) default-scheduler ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-08T15:35:53+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"AWS Best Practices: The Ultimate Guide to Cloud Success\",\"datePublished\":\"2026-02-08T15:35:53+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\"},\"wordCount\":1910,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\",\"name\":\"AWS Best Practices: The Ultimate Guide to Cloud Success - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-02-08T15:35:53+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AWS Best Practices: The Ultimate Guide to Cloud Success\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"AWS Best Practices: The Ultimate Guide to Cloud Success - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/","og_locale":"en_US","og_type":"article","og_title":"AWS Best Practices: The Ultimate Guide to Cloud Success - ITSupportWale","og_description":"text $ kubectl get pods -n prod-core NAME READY STATUS RESTARTS AGE api-gateway-7f5d8b9-x2k9l 0\/1 CrashLoopBackOff 42 14m order-processor-0 0\/1 Pending 0 72h order-processor-1 0\/1 Pending 0 72h payment-service-5d4f9c8-m1n2b 0\/1 CreateContainerConfigError 0 12m $ kubectl describe pod order-processor-0 Events: Type Reason Age From Message &#8212;- &#8212;&#8212; &#8212;- &#8212;- &#8212;&#8212;- Warning FailedScheduling 3m (x450 over 72h) default-scheduler ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-02-08T15:35:53+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"AWS Best Practices: The Ultimate Guide to Cloud Success","datePublished":"2026-02-08T15:35:53+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/"},"wordCount":1910,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/","url":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/","name":"AWS Best Practices: The Ultimate Guide to Cloud Success - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-02-08T15:35:53+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-the-ultimate-guide-to-cloud-success\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"AWS Best Practices: The Ultimate Guide to Cloud Success"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4496","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4496"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4496\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4496"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4496"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4496"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}