{"id":4489,"date":"2026-02-02T21:15:19","date_gmt":"2026-02-02T15:45:19","guid":{"rendered":"https:\/\/www.itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/"},"modified":"2026-02-17T15:55:43","modified_gmt":"2026-02-17T10:25:43","slug":"10-essential-aws-best-practices-for-cloud-optimization","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/","title":{"rendered":"10 Essential AWS Best Practices for Cloud Optimization"},"content":{"rendered":"<p><strong>INTERNAL POST-MORTEM: INCIDENT #8842-BRAVO<\/strong><br \/>\n<strong>DATE:<\/strong> Monday, October 16, 2023<br \/>\n<strong>DURATION:<\/strong> 72 Hours, 14 Minutes<br \/>\n<strong>TOTAL ESTIMATED LOSS:<\/strong> $85,422.19 (Infrastructure Egress + Compute Over-provisioning + Lost Revenue)<br \/>\n<strong>STATUS:<\/strong> SEV-0 (Mitigated, Not Resolved)<br \/>\n<strong>AUTHOR:<\/strong> Senior SRE (Platform Reliability Team)<\/p>\n<hr \/>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a613009837bb\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a613009837bb\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#1_THE_INCIDENT_SUMMARY\" >1. THE INCIDENT SUMMARY<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#2_THE_IAM_POLICY_THAT_LEAKED_THE_KEYS_TO_THE_KINGDOM\" >2. THE IAM POLICY THAT LEAKED THE KEYS TO THE KINGDOM<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#3_WHY_OUR_VPC_PEERING_WAS_A_LATENCY_NIGHTMARE\" >3. WHY OUR VPC PEERING WAS A LATENCY NIGHTMARE<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#4_S3_BUCKET_NEGLIGENCE_PUBLIC_IS_NOT_A_PERMISSION\" >4. S3 BUCKET NEGLIGENCE: PUBLIC IS NOT A PERMISSION<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#5_THE_DATABASE_DEADLOCK_THAT_COST_US_A_QUARTERS_PROFIT\" >5. THE DATABASE DEADLOCK THAT COST US A QUARTER&#8217;S PROFIT<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#6_CLOUDWATCH_LOGS_THE_4_AM_GHOST_IN_THE_MACHINE\" >6. CLOUDWATCH LOGS: THE 4 AM GHOST IN THE MACHINE<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#7_FINAL_RECKONING_IMPLEMENTING_AWS_BEST_PRACTICES_BEFORE_WE_GO_UNDER\" >7. FINAL RECKONING: IMPLEMENTING AWS BEST PRACTICES BEFORE WE GO UNDER<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#APPENDIX_A_THE_%E2%80%9CSHAME%E2%80%9D_LIST_RESOURCES_DELETED\" >APPENDIX A: THE &#8220;SHAME&#8221; LIST (RESOURCES DELETED)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#APPENDIX_B_THE_RECOVERY_SCRIPT_SNIPPET\" >APPENDIX B: THE RECOVERY SCRIPT (SNIPPET)<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"1_THE_INCIDENT_SUMMARY\"><\/span>1. THE INCIDENT SUMMARY<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><strong>TIMESTAMP: 2023-10-13 23:43:12 UTC<\/strong><br \/>\nThe pager went off. I was three sips into a beer. By 23:45, the AWS Billing Alert hit my inbox. <\/p>\n<p><strong>SUBJECT: [URGENT] AWS Billing Alert: Estimated charges for the current month have exceeded your threshold of $10,000.00.<\/strong><\/p>\n<p>The &#8220;Current Estimate&#8221; wasn&#8217;t $10,001. It was <strong>$42,908.12<\/strong>. By the time I logged into the Billing Dashboard, it was $44,200. We were burning $1,200 an hour on a Friday night. This wasn&#8217;t a traffic spike. This wasn&#8217;t a successful marketing campaign. This was a catastrophic failure of engineering discipline. <\/p>\n<p>The dashboard showed a vertical line in Data Transfer costs. Someone had bypassed the staging environment and pushed a &#8220;hotfix&#8221; to the production VPC that turned our internal data sync into a global egress nightmare. We spent the next 72 hours clawing back our infrastructure from the brink of bankruptcy.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"2_THE_IAM_POLICY_THAT_LEAKED_THE_KEYS_TO_THE_KINGDOM\"><\/span>2. THE IAM POLICY THAT LEAKED THE KEYS TO THE KINGDOM<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The first point of failure wasn&#8217;t the code; it was the permissions. A junior developer, frustrated by &#8220;access denied&#8221; errors while trying to debug a Lambda function, decided to &#8220;simplify&#8221; the IAM policy. They attached a policy that effectively turned our production environment into an open playground.<\/p>\n<p>Following <strong>aws best<\/strong> practices isn&#8217;t a suggestion; it&#8217;s a survival tactic that was ignored here. We found a policy attached to a <code>dev-temp-role<\/code> that looked like this:<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n    &quot;Version&quot;: &quot;2012-10-17&quot;,\n    &quot;Statement&quot;: [\n        {\n            &quot;Effect&quot;: &quot;Allow&quot;,\n            &quot;Action&quot;: &quot;*&quot;,\n            &quot;Resource&quot;: &quot;*&quot;\n        }\n    ]\n}\n<\/code><\/pre>\n<p>This is the &#8220;God Mode&#8221; policy. It was used to &#8220;test&#8221; a script that was supposed to move logs to S3. Instead, the script had a logic loop. Because the role had <code>iam:CreateUser<\/code> and <code>iam:AttachUserPolicy<\/code> permissions, a compromised set of temporary credentials allowed an automated bot to spin up 50 <code>p3.16xlarge<\/code> instances in <code>us-east-1<\/code>, <code>us-west-2<\/code>, and <code>eu-central-1<\/code> for GPU mining.<\/p>\n<p><strong>THE FIX:<\/strong><br \/>\nI had to run a scorched-earth script to identify every principal with <code>AdministratorAccess<\/code> that wasn&#8217;t the break-glass account. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Identifying the carnage\naws iam list-attached-user-policies --user-name &lt;redacted&gt;\naws iam list-roles --query 'Roles[?contains(AssumeRolePolicyDocument.Statement[].Principal.AWS, `*`)].RoleName'\n<\/code><\/pre>\n<p>We revoked the sessions and implemented a Service Control Policy (SCP) at the Organization level to deny the ability to create <code>p3<\/code> or <code>g4<\/code> instances in any region except our primary. If you want to use a $24-an-hour instance, you now have to justify it to me in person.<\/p>\n<p>The new policy for the Lambda function now follows the Principle of Least Privilege. It targets specific ARNs. No wildcards. No shortcuts.<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n    &quot;Version&quot;: &quot;2012-10-17&quot;,\n    &quot;Statement&quot;: [\n        {\n            &quot;Effect&quot;: &quot;Allow&quot;,\n            &quot;Action&quot;: [\n                &quot;s3:PutObject&quot;,\n                &quot;s3:PutObjectAcl&quot;\n            ],\n            &quot;Resource&quot;: &quot;arn:aws:s3:::prod-logs-app-01\/*&quot;,\n            &quot;Condition&quot;: {\n                &quot;StringEquals&quot;: {\n                    &quot;aws:SourceVpc&quot;: &quot;vpc-0a1b2c3d4e5f6g7h8&quot;\n                }\n            }\n        }\n    ]\n}\n<\/code><\/pre>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"3_WHY_OUR_VPC_PEERING_WAS_A_LATENCY_NIGHTMARE\"><\/span>3. WHY OUR VPC PEERING WAS A LATENCY NIGHTMARE<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>While the crypto-miners were eating our compute budget, our core application was dying because of a &#8220;flat network&#8221; design. Someone thought it would be &#8220;easier&#8221; to peer every VPC together in a full mesh. <\/p>\n<p>We had <code>10.0.0.0\/16<\/code> in Production and <code>10.0.0.0\/16<\/code> in Staging. Yes, you read that right. Overlapping CIDR blocks. To &#8220;fix&#8221; this, a previous engineer had set up a complex series of NAT Gateways and secondary IP ranges that created a routing loop.<\/p>\n<p>When the mining instances started saturating the NAT Gateway, the connection tracking table (conntrack) hit its limit. The NAT Gateway has a limit of 55,000 concurrent connections to a single destination. We hit that in four minutes.<\/p>\n<p><strong>THE FIX:<\/strong><br \/>\nWe had to tear down the peering and move to a Transit Gateway architecture. But first, I had to identify where the traffic was actually going. I ran this to check the NAT Gateway metrics:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">aws cloudwatch get-metric-statistics \\\n    --namespace AWS\/NATGateway \\\n    --metric-name ErrorPortAllocation \\\n    --dimensions Name=NatGatewayId,Value=nat-0123456789abcdef0 \\\n    --start-time 2023-10-14T00:00:00Z \\\n    --end-time 2023-10-14T01:00:00Z \\\n    --period 60 \\\n    --statistics Sum\n<\/code><\/pre>\n<p>The output was a wall of non-zero integers. We were dropping packets because the NAT Gateway was choked. We migrated the critical services to use VPC Endpoints (Interface Endpoints) for S3 and DynamoDB. This kept the traffic on the AWS private backbone and off the public internet, cutting our NAT Gateway egress bill by 80% instantly.<\/p>\n<p>We also re-addressed the subnets. No more <code>\/16<\/code> for everything. We moved to a structured tier:<br \/>\n&#8211; <strong>Public Subnets:<\/strong> <code>\/24<\/code> (Load Balancers, Bastions)<br \/>\n&#8211; <strong>Private App Subnets:<\/strong> <code>\/22<\/code> (EC2 Fleet, EKS Nodes)<br \/>\n&#8211; <strong>Data Subnets:<\/strong> <code>\/24<\/code> (RDS, ElastiCache)<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"4_S3_BUCKET_NEGLIGENCE_PUBLIC_IS_NOT_A_PERMISSION\"><\/span>4. S3 BUCKET NEGLIGENCE: PUBLIC IS NOT A PERMISSION<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>At 03:00 Saturday, I discovered why the egress bill was still climbing even after I killed the mining instances. An S3 bucket named <code>company-assets-backup<\/code> had been set to public. <\/p>\n<p>Why? Because a frontend dev couldn&#8217;t get the CORS policy right for a staging site, so they just hit the &#8220;Make Public&#8221; button and checked &#8220;I understand the risks.&#8221; They didn&#8217;t understand the risks. <\/p>\n<p>A crawler found the bucket. The bucket contained 4TB of uncompressed database snapshots (another failure for the Storage section). The crawler started downloading the entire bucket from a GCP region. We were paying $0.09 per GB for someone to steal our data.<\/p>\n<p><strong>THE FIX:<\/strong><br \/>\nI didn&#8217;t just fix the bucket; I locked the entire account.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># The &quot;I'm done with this&quot; command\naws s3api put-public-access-block \\\n    --bucket company-assets-backup \\\n    --public-access-block-configuration &quot;BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true&quot;\n<\/code><\/pre>\n<p>Then I audited every bucket in the account:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">for bucket in $(aws s3api list-buckets --query &quot;Buckets[].Name&quot; --output text); do\n    echo &quot;Checking $bucket&quot;\n    aws s3api get-public-access-block --bucket $bucket || echo &quot;$bucket IS EXPOSED&quot;\ndone\n<\/code><\/pre>\n<p>We implemented S3 Object Lock and moved all &#8220;backup&#8221; data to S3 Glacier Deep Archive with a lifecycle policy. Storing 4TB of &#8220;backups&#8221; in S3 Standard is an expensive way to prove you don&#8217;t know how to use cold storage.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"5_THE_DATABASE_DEADLOCK_THAT_COST_US_A_QUARTERS_PROFIT\"><\/span>5. THE DATABASE DEADLOCK THAT COST US A QUARTER&#8217;S PROFIT<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>By Saturday afternoon, the app was back up, but the RDS instance (an <code>m5.4xlarge<\/code>) was pinned at 99% CPU. The &#8220;architectural shortcut&#8221; here was a lack of Read Replicas. The application was hitting the primary writer for every single analytics query.<\/p>\n<p>Worse, the storage was configured as <code>gp2<\/code>. For those who don&#8217;t spend their lives in the console, <code>gp2<\/code> uses a burst credit system for IOPS. Once you run out of credits, your disk performance drops to the baseline. For a 100GB volume, that baseline is 300 IOPS. Our app needs 5,000. <\/p>\n<p>The database was in an I\/O wait death spiral.<\/p>\n<p><strong>THE FIX:<\/strong><br \/>\nWe performed a zero-downtime migration to <code>gp3<\/code> and scaled the storage to 1TB to get the 3,000 baseline IOPS, then manually provisioned it to 12,000 IOPS. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">aws rds modify-db-instance \\\n    --db-instance-identifier prod-db-master \\\n    --allocated-storage 1000 \\\n    --storage-type gp3 \\\n    --iops 12000 \\\n    --apply-immediately\n<\/code><\/pre>\n<p>But that wasn&#8217;t enough. I had to kill the long-running queries that were locking the tables. I logged into the instance and saw the horror:<\/p>\n<pre class=\"codehilite\"><code class=\"language-sql\">SELECT * FROM orders JOIN users JOIN tracking_events WHERE orders.created_at &gt; '2023-01-01';\n<\/code><\/pre>\n<p>No indexes. Full table scans on three joined tables. I added a Read Replica (<code>r5.large<\/code>) and forced the analytics engine to point there. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">aws rds create-db-instance-read-replica \\\n    --db-instance-identifier prod-db-replica-01 \\\n    --source-db-instance-identifier prod-db-master \\\n    --db-instance-class db.r5.large\n<\/code><\/pre>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"6_CLOUDWATCH_LOGS_THE_4_AM_GHOST_IN_THE_MACHINE\"><\/span>6. CLOUDWATCH LOGS: THE 4 AM GHOST IN THE MACHINE<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Sunday morning, 04:00. The billing dashboard showed a new spike: CloudWatch Logs. <\/p>\n<p>In the panic of Friday night, someone had enabled <code>DEBUG<\/code> logging across the entire EKS cluster to &#8220;see what was happening.&#8221; We were ingesting 500GB of logs per hour. CloudWatch ingestion costs $0.50 per GB. Do the math. That&#8217;s $250 an hour just to watch the cluster breathe.<\/p>\n<p>Most of these logs were &#8220;Connection Refused&#8221; errors from the aforementioned NAT Gateway failure, repeating 100 times a second.<\/p>\n<p><strong>THE FIX:<\/strong><br \/>\nI had to bulk-update the retention policies and the log levels. We were keeping logs &#8220;Forever.&#8221; Why? Because &#8220;storage is cheap.&#8221; No, it isn&#8217;t.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Finding the offenders\naws logs describe-log-groups --query 'logGroups[?storedBytes &gt; `1000000000`].[logGroupName, storedBytes]'\n\n# Setting a 7-day retention because we aren't a library\nfor group in $(aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text); do\n    aws logs put-retention-policy --log-group-name $group --retention-in-days 7\ndone\n<\/code><\/pre>\n<p>We also implemented a FluentBit filter to drop any log entry that didn&#8217;t have a <code>level<\/code> of <code>ERROR<\/code> or <code>CRITICAL<\/code> in production. If you want to <code>DEBUG<\/code>, do it in staging.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"7_FINAL_RECKONING_IMPLEMENTING_AWS_BEST_PRACTICES_BEFORE_WE_GO_UNDER\"><\/span>7. FINAL RECKONING: IMPLEMENTING AWS BEST PRACTICES BEFORE WE GO UNDER<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>It is now Monday. I have had four hours of sleep. The $85,000 is gone. It\u2019s a &#8220;learning expense&#8221; now. But if I see another <code>t2.micro<\/code> in a production environment, I\u2019m quitting.<\/p>\n<p>The root cause of this weekend wasn&#8217;t a technical glitch. It was &#8220;drift.&#8221; We allowed the infrastructure to drift away from the code. We allowed manual changes in the console (&#8220;ClickOps&#8221;) instead of enforcing Terraform.<\/p>\n<p><strong>THE NON-NEGOTIABLES:<\/strong><\/p>\n<ol>\n<li><strong>Infrastructure as Code (IaC) is Law:<\/strong> No one touches the console. If it\u2019s not in a Terraform file, it doesn&#8217;t exist. We will run <code>terraform plan<\/code> every 60 minutes via a cron job to detect drift. If someone manually changes a Security Group, Terraform will revert it.<\/li>\n<li><strong>Egress Monitoring:<\/strong> We are implementing VPC Flow Logs and sending them to a dedicated S3 bucket for analysis with Athena. If egress exceeds $100\/hour, the circuit breaker trips.<\/li>\n<li><strong>Instance Selection:<\/strong> No more burstable instances (<code>t-series<\/code>) for production workloads. We use <code>m5<\/code> or <code>c5<\/code> instances with dedicated resources. We will use <code>gp3<\/code> for all EBS volumes to decouple IOPS from capacity.<\/li>\n<li><strong>IAM Hardening:<\/strong> All developers are losing their direct access to the <code>prod<\/code> account. You will use AWS IAM Identity Center (formerly SSO) to assume short-lived roles.<\/li>\n<li><strong>Tagging Policy:<\/strong> Every resource must have a <code>CostCenter<\/code>, <code>Environment<\/code>, and <code>Owner<\/code> tag. <\/li>\n<\/ol>\n<pre class=\"codehilite\"><code class=\"language-bash\"># The new standard for launching anything (if I ever let you again)\naws ec2 run-instances \\\n    --image-id ami-0abcdef1234567890 \\\n    --instance-type m5.large \\\n    --subnet-id subnet-01234567 \\\n    --tag-specifications 'ResourceType=instance,Tags=[{Key=CostCenter,Value=Engineering},{Key=Environment,Value=Production}]' \\\n    --monitoring &quot;Enabled=true&quot;\n<\/code><\/pre>\n<p><strong>THE FINAL WORD:<\/strong><\/p>\n<p>We were lucky. If the botnet had stayed active for the whole weekend, the bill would have been $250,000. We are currently operating on a &#8220;Trust, but Verify&#8221; model, but after this, the &#8220;Trust&#8221; part is gone. <\/p>\n<p>Following <strong>aws best<\/strong> practices is the only thing standing between us and the total liquidation of this company&#8217;s assets. If you think a security group rule with <code>0.0.0.0\/0<\/code> is &#8220;fine for a quick test,&#8221; please hand in your badge.<\/p>\n<p>I\u2019m going home. Do not page me unless the building is literally on fire. Even then, check the CloudWatch logs first to see if the fire is in the budget or the server room.<\/p>\n<p><strong>END OF REPORT<\/strong><\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"APPENDIX_A_THE_%E2%80%9CSHAME%E2%80%9D_LIST_RESOURCES_DELETED\"><\/span>APPENDIX A: THE &#8220;SHAME&#8221; LIST (RESOURCES DELETED)<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<table>\n<thead>\n<tr>\n<th style=\"text-align: left;\">Resource ID<\/th>\n<th style=\"text-align: left;\">Type<\/th>\n<th style=\"text-align: left;\">Reason<\/th>\n<th style=\"text-align: left;\">Cost (72h)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left;\"><code>i-0992837465<\/code><\/td>\n<td style=\"text-align: left;\"><code>p3.16xlarge<\/code><\/td>\n<td style=\"text-align: left;\">Unauthorized Mining<\/td>\n<td style=\"text-align: left;\">$1,728.00<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><code>nat-012345678<\/code><\/td>\n<td style=\"text-align: left;\"><code>NAT Gateway<\/code><\/td>\n<td style=\"text-align: left;\">Egress Loop<\/td>\n<td style=\"text-align: left;\">$12,400.00<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><code>vol-088776655<\/code><\/td>\n<td style=\"text-align: left;\"><code>io2<\/code><\/td>\n<td style=\"text-align: left;\">Over-provisioned (100k IOPS)<\/td>\n<td style=\"text-align: left;\">$4,200.00<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><code>cw-logs-prod<\/code><\/td>\n<td style=\"text-align: left;\"><code>CloudWatch<\/code><\/td>\n<td style=\"text-align: left;\">Debug Log Ingestion<\/td>\n<td style=\"text-align: left;\">$18,900.00<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><code>s3-egress<\/code><\/td>\n<td style=\"text-align: left;\"><code>Data Transfer<\/code><\/td>\n<td style=\"text-align: left;\">Public Bucket Leak<\/td>\n<td style=\"text-align: left;\">$32,000.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"APPENDIX_B_THE_RECOVERY_SCRIPT_SNIPPET\"><\/span>APPENDIX B: THE RECOVERY SCRIPT (SNIPPET)<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>This script was used to force-terminate the rogue instances across all regions. It\u2019s blunt, but effective.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">#!\/bin\/bash\nREGIONS=$(aws ec2 describe-regions --query &quot;Regions[].RegionName&quot; --output text)\n\nfor region in $REGIONS; do\n    echo &quot;Checking region: $region&quot;\n    INSTANCES=$(aws ec2 describe-instances --region $region \\\n        --filters &quot;Name=instance-state-name,Values=running&quot; \\\n        --query &quot;Reservations[].Instances[?InstanceType=='p3.16xlarge'].InstanceId&quot; --output text)\n\n    if [ ! -z &quot;$INSTANCES&quot; ]; then\n        echo &quot;TERMINATING ROGUE INSTANCES IN $region: $INSTANCES&quot;\n        aws ec2 terminate-instances --region $region --instance-ids $INSTANCES\n    fi\ndone\n<\/code><\/pre>\n<p>This script is now part of our automated incident response. If an instance type we don&#8217;t use appears in our account, it is terminated within 60 seconds. No questions asked. No exceptions.<\/p>\n<p><strong>POST-MORTEM COMPLETE.<\/strong><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/oracle.itsupportwale.com\/blog\/install-ubuntu-20-04-lts-server\/\">Install Ubuntu 20 04 Lts Server<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/bulk-sms-tips-will-pump-up-your-sales-almost-instantly\/\">Bulk Sms Tips Will Pump Up Your Sales Almost Instantly<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/getting-started-with-iot\/\">Getting Started With Iot<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>INTERNAL POST-MORTEM: INCIDENT #8842-BRAVO DATE: Monday, October 16, 2023 DURATION: 72 Hours, 14 Minutes TOTAL ESTIMATED LOSS: $85,422.19 (Infrastructure Egress + Compute Over-provisioning + Lost Revenue) STATUS: SEV-0 (Mitigated, Not Resolved) AUTHOR: Senior SRE (Platform Reliability Team) 1. THE INCIDENT SUMMARY TIMESTAMP: 2023-10-13 23:43:12 UTC The pager went off. I was three sips into a &#8230; <a title=\"10 Essential AWS Best Practices for Cloud Optimization\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\" aria-label=\"Read more  on 10 Essential AWS Best Practices for Cloud Optimization\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4489","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"INTERNAL POST-MORTEM: INCIDENT #8842-BRAVO DATE: Monday, October 16, 2023 DURATION: 72 Hours, 14 Minutes TOTAL ESTIMATED LOSS: $85,422.19 (Infrastructure Egress + Compute Over-provisioning + Lost Revenue) STATUS: SEV-0 (Mitigated, Not Resolved) AUTHOR: Senior SRE (Platform Reliability Team) 1. THE INCIDENT SUMMARY TIMESTAMP: 2023-10-13 23:43:12 UTC The pager went off. I was three sips into a ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-02T15:45:19+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-17T10:25:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"10 Essential AWS Best Practices for Cloud Optimization\",\"datePublished\":\"2026-02-02T15:45:19+00:00\",\"dateModified\":\"2026-02-17T10:25:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\"},\"wordCount\":1581,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\",\"name\":\"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-02-02T15:45:19+00:00\",\"dateModified\":\"2026-02-17T10:25:43+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"10 Essential AWS Best Practices for Cloud Optimization\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/","og_locale":"en_US","og_type":"article","og_title":"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale","og_description":"INTERNAL POST-MORTEM: INCIDENT #8842-BRAVO DATE: Monday, October 16, 2023 DURATION: 72 Hours, 14 Minutes TOTAL ESTIMATED LOSS: $85,422.19 (Infrastructure Egress + Compute Over-provisioning + Lost Revenue) STATUS: SEV-0 (Mitigated, Not Resolved) AUTHOR: Senior SRE (Platform Reliability Team) 1. THE INCIDENT SUMMARY TIMESTAMP: 2023-10-13 23:43:12 UTC The pager went off. I was three sips into a ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-02-02T15:45:19+00:00","article_modified_time":"2026-02-17T10:25:43+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"10 Essential AWS Best Practices for Cloud Optimization","datePublished":"2026-02-02T15:45:19+00:00","dateModified":"2026-02-17T10:25:43+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/"},"wordCount":1581,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/","url":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/","name":"10 Essential AWS Best Practices for Cloud Optimization - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-02-02T15:45:19+00:00","dateModified":"2026-02-17T10:25:43+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/10-essential-aws-best-practices-for-cloud-optimization\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"10 Essential AWS Best Practices for Cloud Optimization"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4489","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4489"}],"version-history":[{"count":3,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4489\/revisions"}],"predecessor-version":[{"id":4622,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4489\/revisions\/4622"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4489"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4489"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4489"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}