{"id":4773,"date":"2026-04-27T22:20:03","date_gmt":"2026-04-27T16:50:03","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/"},"modified":"2026-04-27T22:20:03","modified_gmt":"2026-04-27T16:50:03","slug":"10-essential-aws-best-practices-for-cloud-optimization-2","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/","title":{"rendered":"10 Essential AWS Best Practices for Cloud Optimization"},"content":{"rendered":"<p><strong>INCIDENT SUMMARY<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<th style=\"text-align: left;\">Attribute<\/th>\n<th style=\"text-align: left;\">Details<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left;\"><strong>Incident ID<\/strong><\/td>\n<td style=\"text-align: left;\">BKR-2024-09-12-CRITICAL<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><strong>Severity<\/strong><\/td>\n<td style=\"text-align: left;\">Level 0 (Existential Threat)<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><strong>Status<\/strong><\/td>\n<td style=\"text-align: left;\">Resolved (Post-Mortem Stage)<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><strong>Duration<\/strong><\/td>\n<td style=\"text-align: left;\">74 Hours, 12 Minutes<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><strong>Impact<\/strong><\/td>\n<td style=\"text-align: left;\">$412,000 in unplanned AWS spend; 99.9% API latency increase; Total CI\/CD paralysis.<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><strong>Primary Root Cause<\/strong><\/td>\n<td style=\"text-align: left;\">Failure to implement <strong>aws best<\/strong> practices regarding VPC Endpoints, IAM scoping, and Terraform state management.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<hr \/>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69efc061e58a3\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69efc061e58a3\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#TIMELINE_OF_THE_COLLAPSE\" >TIMELINE OF THE COLLAPSE<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#The_IAM_Policy_That_Ate_Our_Budget\" >The IAM Policy That Ate Our Budget<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#The_Networking_Nightmare_Transit_Gateway_and_CIDR_Collisions\" >The Networking Nightmare: Transit Gateway and CIDR Collisions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#The_Cost_of_Ignorance_gp2_Throttling_and_IOPS_Debt\" >The Cost of Ignorance: gp2 Throttling and IOPS Debt<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#The_State_File_Disaster_Terraform_in_the_Dark\" >The State File Disaster: Terraform in the Dark<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#The_S3_Leak_That_Wasnt_a_Leak%E2%80%94Until_It_Was\" >The S3 Leak That Wasn&#8217;t a Leak\u2014Until It Was<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#The_Silent_Killer_CloudWatch_Logs_and_Retention_Policies\" >The Silent Killer: CloudWatch Logs and Retention Policies<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#Conclusion_The_Price_of_%E2%80%9CMoving_Fast%E2%80%9D\" >Conclusion: The Price of &#8220;Moving Fast&#8221;<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"TIMELINE_OF_THE_COLLAPSE\"><\/span>TIMELINE OF THE COLLAPSE<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul>\n<li><strong>2024-09-10 09:00 EST<\/strong>: Migration of the &#8220;Legacy-Core&#8221; monolith to the new <code>prod-v2<\/code> VPC begins. The engineering lead decides to &#8220;keep it simple&#8221; by using a single Transit Gateway for all inter-region traffic.<\/li>\n<li><strong>2024-09-10 14:30 EST<\/strong>: Terraform apply (v1.7.0) finishes. The state file is stored in an S3 bucket named <code>company-tf-state-prod<\/code>. Versioning is disabled because &#8220;we wanted to save on storage costs.&#8221;<\/li>\n<li><strong>2024-09-11 02:15 EST<\/strong>: PagerDuty triggers. The <code>OrderProcessor<\/code> service is timing out. Latency has spiked from 45ms to 12,000ms.<\/li>\n<li><strong>2024-09-11 04:00 EST<\/strong>: I am woken up. I find the NAT Gateway in <code>us-east-1a<\/code> is processing 4.2 TB of data per hour.<\/li>\n<li><strong>2024-09-11 08:00 EST<\/strong>: CFO sends an urgent Slack message. The AWS Cost Explorer &#8220;Daily Spend&#8221; view shows a vertical line. We are burning $15,000 an hour.<\/li>\n<li><strong>2024-09-11 11:00 EST<\/strong>: A junior engineer attempts to &#8220;fix&#8221; the routing table and accidentally deletes the Terraform state file from S3 using an over-privileged IAM role.<\/li>\n<li><strong>2024-09-12 15:00 EST<\/strong>: Manual recovery of 450 resources begins. The &#8220;Cloud-Native&#8221; dream is officially a nightmare.<\/li>\n<\/ul>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"The_IAM_Policy_That_Ate_Our_Budget\"><\/span>The IAM Policy That Ate Our Budget<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The first mistake wasn&#8217;t technical; it was philosophical. The team treated IAM like a nuisance rather than a perimeter. They wanted &#8220;velocity,&#8221; which in this industry is usually code for &#8220;I don&#8217;t want to read the documentation.&#8221; <\/p>\n<p>We found a role named <code>FullAccessAppRole<\/code>. It was attached to every EC2 instance in the auto-scaling group. Here is the JSON policy that allowed a compromised container to start spinning up <code>p4d.24xlarge<\/code> instances in regions we don&#8217;t even operate in:<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n    &quot;Version&quot;: &quot;2012-10-17&quot;,\n    &quot;Statement&quot;: [\n        {\n            &quot;Effect&quot;: &quot;Allow&quot;,\n            &quot;Action&quot;: &quot;*&quot;,\n            &quot;Resource&quot;: &quot;*&quot;\n        },\n        {\n            &quot;Effect&quot;: &quot;Allow&quot;,\n            &quot;Action&quot;: &quot;iam:PassRole&quot;,\n            &quot;Resource&quot;: &quot;arn:aws:iam::123456789012:role\/*&quot;\n        }\n    ]\n}\n<\/code><\/pre>\n<p>The <code>iam:PassRole<\/code> on <code>*<\/code> is the smoking gun. It allowed any service with this role to pass any other role to a new service. When the <code>OrderProcessor<\/code> was hit with a basic SSRF (Server-Side Request Forgery) attack, the attacker didn&#8217;t just steal data; they used our own infrastructure to mine Monero.<\/p>\n<p>I ran this <code>jq<\/code> filter against our IAM export to see how deep the rot went:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">aws iam list-policies --scope Local --output json | \\\njq -r '.Policies[] | select(.AttachmentCount &gt; 0) | .PolicyName' | \\\nxargs -I {} aws iam get-policy-version --policy-arn arn:aws:iam::123456789012:policy\/{} --version-id v1 --output json | \\\njq '.PolicyVersion.Document.Statement[] | select(.Action == &quot;*&quot; and .Resource == &quot;*&quot;)'\n<\/code><\/pre>\n<p>The output was a wall of text. We had 14 different policies granting administrative access to service accounts that only needed to read from a single S3 bucket. By ignoring <strong>aws best<\/strong> practices for &#8220;Least Privilege,&#8221; we handed the keys to the kingdom to anyone who could find a bug in our Node.js runtime.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"The_Networking_Nightmare_Transit_Gateway_and_CIDR_Collisions\"><\/span>The Networking Nightmare: Transit Gateway and CIDR Collisions<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The &#8220;architects&#8221; decided that VPC Peering was &#8220;too hard to manage.&#8221; They opted for a Transit Gateway (TGW) to connect our legacy VPC, our new VPC, and our on-prem data center. <\/p>\n<p>The problem? They didn&#8217;t plan the CIDR blocks. They used <code>10.0.0.0\/16<\/code> for everything. When you have overlapping CIDRs in a TGW environment, the routing table becomes a game of Russian Roulette. <\/p>\n<p>We saw packets destined for the database (<code>10.0.5.22<\/code>) being routed back to the legacy VPN gateway because the TGW route table had a more specific prefix (<code>10.0.5.0\/24<\/code>) pointing to the wrong attachment. <\/p>\n<p>Here is the Terraform snippet that caused the loop:<\/p>\n<pre class=\"codehilite\"><code class=\"language-hcl\">resource &quot;aws_ec2_transit_gateway_route&quot; &quot;loop_of_death&quot; {\n  destination_cidr_block         = &quot;10.0.0.0\/8&quot; # Why? Just why?\n  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.legacy.id\n  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.main.id\n}\n<\/code><\/pre>\n<p>By using a <code>\/8<\/code> summary route, they effectively blackholed any internal traffic that didn&#8217;t have an explicit match. But the real financial pain came from the NAT Gateway. <\/p>\n<p>The <code>prod-v2<\/code> VPC was configured with all instances in private subnets. Standard stuff. However, they forgot to provision an S3 Gateway Endpoint. Every time an instance pulled a 2GB container image from S3, or uploaded a log file, that data traveled out through the NAT Gateway.<\/p>\n<p>AWS charges $0.045 per GB for NAT Gateway data processing. That sounds small until you realize your logging agent is pushing 500GB of debug logs an hour because someone left <code>LOG_LEVEL=trace<\/code> on in production.<\/p>\n<p>I caught this using the following AWS CLI command to inspect the CloudWatch metrics for the NAT Gateway:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">aws cloudwatch get-metric-statistics \\\n    --namespace AWS\/NATGateway \\\n    --metric-name BytesOutToDestination \\\n    --dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6g7h8 \\\n    --start-time 2024-09-11T00:00:00Z \\\n    --end-time 2024-09-11T23:59:59Z \\\n    --period 3600 \\\n    --statistics Sum \\\n    --unit Bytes\n<\/code><\/pre>\n<p>The result showed we were processing petabytes of data that should have stayed on the AWS internal network via a free VPC Endpoint. We were paying $45.00 per TB for the privilege of moving data three feet across the data center floor.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"The_Cost_of_Ignorance_gp2_Throttling_and_IOPS_Debt\"><\/span>The Cost of Ignorance: gp2 Throttling and IOPS Debt<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>While the NAT Gateway was bleeding us dry, the database was dying a slow death. The team had provisioned 1TB <code>gp2<\/code> volumes for the Postgres nodes. They thought &#8220;1TB is plenty of space.&#8221; <\/p>\n<p>They didn&#8217;t understand the <code>gp2<\/code> burst bucket model. On <code>gp2<\/code>, you get 3 IOPS per GB. A 1TB volume gives you a baseline of 3,000 IOPS. If you need more, you consume &#8220;burst credits.&#8221; Once those credits are gone, you are hard-capped at 3,000 IOPS.<\/p>\n<p>Our database was hitting 12,000 IOPS during the morning peak. For the first two hours, it was fine. Then, the burst bucket hit zero. Latency went from 2ms to 200ms instantly. The application servers, waiting for the DB, started stacking threads. The health checks failed. The Auto Scaling Group (ASG) thought the instances were dead and terminated them. <\/p>\n<p>The new instances started up, tried to pull the massive container images through the NAT Gateway (adding to the cost), and then immediately hit the same IOPS-throttled database. It was a circular dependency of failure.<\/p>\n<p>We should have used <code>gp3<\/code>. With <code>gp3<\/code>, you get 3,000 IOPS baseline regardless of volume size, and you can scale IOPS and throughput independently. <\/p>\n<p>I used this command to identify every throttled volume in the account:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">aws cloudwatch get-metric-data \\\n    --metric-data-queries '[{&quot;Id&quot;:&quot;m1&quot;,&quot;MetricStat&quot;:{&quot;Metric&quot;:{&quot;Namespace&quot;:&quot;AWS\/EBS&quot;,&quot;MetricName&quot;:&quot;BurstBalance&quot;,&quot;Dimensions&quot;:[{&quot;Name&quot;:&quot;VolumeId&quot;,&quot;Value&quot;:&quot;vol-0987654321fedcba&quot;}]},&quot;Period&quot;:300,&quot;Stat&quot;:&quot;Average&quot;}}]' \\\n    --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \\\n    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)\n<\/code><\/pre>\n<p>The <code>BurstBalance<\/code> was at 0%. We were running a production database on the performance equivalent of a 5,400 RPM laptop drive from 2004.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"The_State_File_Disaster_Terraform_in_the_Dark\"><\/span>The State File Disaster: Terraform in the Dark<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This is the part that still makes my hands shake. Terraform is a powerful tool, but in the hands of someone who treats it like a bash script, it\u2019s a suicide machine.<\/p>\n<p>The team had configured the S3 backend for Terraform like this:<\/p>\n<pre class=\"codehilite\"><code class=\"language-hcl\">terraform {\n  backend &quot;s3&quot; {\n    bucket         = &quot;company-tf-state-prod&quot;\n    key            = &quot;network\/terraform.tfstate&quot;\n    region         = &quot;us-east-1&quot;\n    # dynamodb_table = &quot;terraform-lock&quot; # COMMENTED OUT BECAUSE &quot;IT WAS SLOW&quot;\n  }\n}\n<\/code><\/pre>\n<p>No DynamoDB table for state locking. No S3 bucket versioning. No MFA delete.<\/p>\n<p>On the second day of the outage, two engineers were trying to fix the NAT Gateway issue simultaneously. Engineer A ran a <code>terraform apply<\/code> to add the VPC Endpoint. Engineer B, unaware of A\u2019s work, was manually editing the state file because of a &#8220;provider drift&#8221; issue. <\/p>\n<p>Engineer B\u2019s manual upload corrupted the JSON structure of the state file. When Engineer A\u2019s <code>apply<\/code> finished, it attempted to write back to the corrupted file. The S3 object was overwritten with a 0-byte file.<\/p>\n<p>Because versioning was disabled, the state of our entire infrastructure\u2014450 resources, including the RDS instances, the TGW, and the IAM roles\u2014was gone. Terraform now thought the world was empty. <\/p>\n<p>The next time someone ran <code>terraform plan<\/code>, the output was a nightmare:<br \/>\n<code>Plan: 450 to add, 0 to change, 0 to destroy.<\/code><\/p>\n<p>If they had clicked &#8220;yes,&#8221; Terraform would have tried to recreate resources that already existed, failing on &#8220;Name already in use&#8221; errors for the next six hours while the site stayed down.<\/p>\n<p>I had to use the <code>aws resourcegroupstaggingapi<\/code> to try and map existing ARNs back to Terraform resource addresses. It was a manual, grueling process of <code>terraform import<\/code> commands that took 14 hours of straight work.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># One of 450 imports. I did this until my eyes bled.\nterraform import aws_instance.web_server i-0123456789abcdef0\n<\/code><\/pre>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"The_S3_Leak_That_Wasnt_a_Leak%E2%80%94Until_It_Was\"><\/span>The S3 Leak That Wasn&#8217;t a Leak\u2014Until It Was<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In the middle of the recovery, we discovered that the &#8220;fix&#8221; for a permissions error earlier in the month was to make an S3 bucket public. The bucket contained &#8220;static assets.&#8221; <\/p>\n<p>The problem is that &#8220;static assets&#8221; to a junior dev included <code>config.json<\/code> files that contained database credentials and API keys for our payment processor. <\/p>\n<p>They didn&#8217;t use S3 Block Public Access at the account level. They didn&#8217;t use Bucket Policies to restrict access to the VPC. They just flipped the switch.<\/p>\n<p>I found it using this:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">aws s3api get-public-access-block --bucket company-secrets-prod --output json\n<\/code><\/pre>\n<p>The response was:<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n    &quot;PublicAccessBlockConfiguration&quot;: {\n        &quot;BlockPublicAcls&quot;: false,\n        &quot;IgnorePublicAcls&quot;: false,\n        &quot;BlockPublicPolicy&quot;: false,\n        &quot;RestrictPublicBuckets&quot;: false\n    }\n}\n<\/code><\/pre>\n<p>This is a direct violation of every <strong>aws best<\/strong> practice in the book. We had to rotate every single credential in the company. Every database password, every Stripe key, every SendGrid token. The operational overhead of rotating 200+ secrets while the network was already failing is why I have grey hair.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"The_Silent_Killer_CloudWatch_Logs_and_Retention_Policies\"><\/span>The Silent Killer: CloudWatch Logs and Retention Policies<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The final insult to our bank account was the CloudWatch Logs bill. When you are in a &#8220;crisis,&#8221; everyone turns on &#8220;Debug&#8221; logging. <\/p>\n<p>We had 400 microservices running in EKS. Each one was spitting out 10MB of logs per minute. The team had set the retention policy to &#8220;Never Expire.&#8221; <\/p>\n<p>CloudWatch Logs ingestion costs $0.50 per GB. Storage costs $0.03 per GB-month. By day three, we had ingested 80TB of logs. Most of those logs were just the same &#8220;Connection Timeout&#8221; error repeated billions of times.<\/p>\n<p>I had to write a script to truncate the retention on every log group in the account because doing it through the console was too slow:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">for group in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do\n    echo &quot;Setting retention for $group&quot;\n    aws logs put-retention-policy --log-group-name &quot;$group&quot; --retention-in-days 7\ndone\n<\/code><\/pre>\n<p>We were paying for the storage of errors that occurred in a version of the software that didn&#8217;t even exist anymore. It was digital hoarding at a corporate scale.<\/p>\n<hr \/>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion_The_Price_of_%E2%80%9CMoving_Fast%E2%80%9D\"><\/span>Conclusion: The Price of &#8220;Moving Fast&#8221;<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We didn&#8217;t almost bankrupt the company because AWS is expensive. We almost bankrupted the company because we treated the cloud like someone else&#8217;s data center where the resources are infinite and the configuration doesn&#8217;t matter.<\/p>\n<p>We ignored the <strong>aws best<\/strong> practices for VPC design, opting for the &#8220;easy&#8221; path that led to a $50k NAT Gateway bill. We ignored IAM scoping, leading to a crypto-jacking incident that cost us $120k in compute. We ignored Terraform state management, which nearly made our infrastructure unrecoverable.<\/p>\n<p>The &#8220;Cloud-Native&#8221; transition isn&#8217;t about moving your VMs to EC2. It&#8217;s about understanding the underlying mechanics of the platform. If you don&#8217;t understand how IOPS credits work, if you don&#8217;t understand how VPC Endpoints save you money, and if you don&#8217;t understand that IAM is your only real firewall, you aren&#8217;t &#8220;innovating.&#8221; You are just waiting for a PagerDuty alert that you can&#8217;t fix.<\/p>\n<p>I&#8217;m going back to sleep. Don&#8217;t touch the <code>terraform.tfstate<\/code> file, or I&#8217;ll revoke your <code>AssumeRole<\/code> permissions before you can finish your next <code>git push<\/code>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/how-to-upgrade-to-python-3-13-on-ubuntu-20-04-and-22-04-lts\/\">How To Upgrade To Python 3 13 On Ubuntu 20 04 And 22 04 Lts<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/vim-commands\/\">Vim Commands<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/mastering-machine-learning-models-types-and-use-cases\/\">Mastering Machine Learning Models Types And Use Cases<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>INCIDENT SUMMARY Attribute Details Incident ID BKR-2024-09-12-CRITICAL Severity Level 0 (Existential Threat) Status Resolved (Post-Mortem Stage) Duration 74 Hours, 12 Minutes Impact $412,000 in unplanned AWS spend; 99.9% API latency increase; Total CI\/CD paralysis. Primary Root Cause Failure to implement aws best practices regarding VPC Endpoints, IAM scoping, and Terraform state management. TIMELINE OF THE &#8230; <a title=\"10 Essential AWS Best Practices for Cloud Optimization\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\" aria-label=\"Read more  on 10 Essential AWS Best Practices for Cloud Optimization\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4773","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"INCIDENT SUMMARY Attribute Details Incident ID BKR-2024-09-12-CRITICAL Severity Level 0 (Existential Threat) Status Resolved (Post-Mortem Stage) Duration 74 Hours, 12 Minutes Impact $412,000 in unplanned AWS spend; 99.9% API latency increase; Total CI\/CD paralysis. Primary Root Cause Failure to implement aws best practices regarding VPC Endpoints, IAM scoping, and Terraform state management. TIMELINE OF THE ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-27T16:50:03+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"10 Essential AWS Best Practices for Cloud Optimization\",\"datePublished\":\"2026-04-27T16:50:03+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\"},\"wordCount\":1687,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\",\"name\":\"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-04-27T16:50:03+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"10 Essential AWS Best Practices for Cloud Optimization\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/","og_locale":"en_US","og_type":"article","og_title":"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale","og_description":"INCIDENT SUMMARY Attribute Details Incident ID BKR-2024-09-12-CRITICAL Severity Level 0 (Existential Threat) Status Resolved (Post-Mortem Stage) Duration 74 Hours, 12 Minutes Impact $412,000 in unplanned AWS spend; 99.9% API latency increase; Total CI\/CD paralysis. Primary Root Cause Failure to implement aws best practices regarding VPC Endpoints, IAM scoping, and Terraform state management. TIMELINE OF THE ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-04-27T16:50:03+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"10 Essential AWS Best Practices for Cloud Optimization","datePublished":"2026-04-27T16:50:03+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/"},"wordCount":1687,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/","url":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/","name":"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-04-27T16:50:03+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"10 Essential AWS Best Practices for Cloud Optimization"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4773","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4773"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4773\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4773"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4773"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4773"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}