{"id":4795,"date":"2026-05-22T22:44:36","date_gmt":"2026-05-22T17:14:36","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/"},"modified":"2026-05-22T22:44:36","modified_gmt":"2026-05-22T17:14:36","slug":"aws-best-practices-optimize-your-cloud-infrastructure","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/","title":{"rendered":"AWS Best Practices: Optimize Your Cloud Infrastructure"},"content":{"rendered":"<p>This is not a &#8220;retrospective.&#8221; It is not a &#8220;learning opportunity.&#8221; It is a post-mortem of a preventable disaster that cost this company six figures in lost revenue and cost me ten years of my life expectancy. If I see one more &#8220;move fast and break things&#8221; sticker on a laptop in this office, I am going to lose what little remains of my sanity. <\/p>\n<p>We didn&#8217;t &#8220;break things.&#8221; We incinerated them. We took a decade of industry-standard reliability engineering and threw it into a woodchipper because someone thought Terraform modules were &#8220;too restrictive&#8221; and that &#8220;manual tweaks in the console&#8221; were faster for a Friday afternoon deployment. <\/p>\n<p>The following is a technical breakdown of the Great Infrastructure Meltdown of 2023. Read it. Internalize it. Because if we don&#8217;t start following <strong>aws best<\/strong> practices immediately, the next time the pager goes off at 3:00 AM, I\u2019m not logging in; I\u2019m deleting my Slack account and moving to a farm in Vermont.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a11d63f93cf0\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a11d63f93cf0\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#H2_The_3_00_AM_Wake-up_Call_A_Timeline_of_Failure\" >H2: The 3:00 AM Wake-up Call: A Timeline of Failure<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#H2_The_IAM_Policy_That_Ate_the_Production_Environment\" >H2: The IAM Policy That Ate the Production Environment<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#H2_Networking_Spaghetti_and_the_VPC_Peering_Nightmare\" >H2: Networking Spaghetti and the VPC Peering Nightmare<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#H2_Why_Your_Auto-Scaling_Group_is_a_Lie\" >H2: Why Your Auto-Scaling Group is a Lie<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#H2_The_Database_Deadlock_RDS_and_the_Missing_Multi-AZ\" >H2: The Database Deadlock: RDS and the Missing Multi-AZ<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#H2_Hard_Lessons_and_the_Path_to_Infrastructure_Nihilism\" >H2: Hard Lessons and the Path to Infrastructure Nihilism<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"H2_The_3_00_AM_Wake-up_Call_A_Timeline_of_Failure\"><\/span>H2: The 3:00 AM Wake-up Call: A Timeline of Failure<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It started with a single 503 error. Then ten. Then ten thousand. My PagerDuty alert didn&#8217;t just beep; it screamed. By 03:05 UTC, the entire US-EAST-1 footprint was a graveyard of timed-out requests and &#8220;Connection Refused&#8221; errors.<\/p>\n<p><strong>Incident Timeline (All times UTC):<\/strong><\/p>\n<ul>\n<li><strong>02:47:05<\/strong>: A junior developer, working on a &#8220;hotfix&#8221; for the legacy billing service, executes a local script using a compromised IAM credential that had full <code>AdministratorAccess<\/code>.<\/li>\n<li><strong>02:50:12<\/strong>: The script, intended to clear a &#8220;temp&#8221; bucket, instead targets the production S3 bucket containing our centralized Terraform state files and the primary application assets.<\/li>\n<li><strong>02:55:00<\/strong>: CloudWatch alarms trigger for the <code>api-gateway-prod<\/code> service. Latency spikes from 45ms to 12,000ms.<\/li>\n<li><strong>03:00:15<\/strong>: My pager goes off. I attempt to log into the AWS Console. I am met with an &#8220;Access Denied&#8221; error because the script also modified the IAM role I use for emergency access.<\/li>\n<li><strong>03:10:45<\/strong>: The Auto-Scaling Group (ASG) for the core microservices begins a &#8220;death spiral.&#8221; It detects unhealthy instances and attempts to replace them, but the launch templates are referencing AMI IDs that no longer exist in the registry.<\/li>\n<li><strong>03:22:30<\/strong>: The RDS Primary instance in <code>us-east-1a<\/code> suffers a storage failure. Because we were running a Single-AZ deployment to &#8220;save on data transfer costs,&#8221; there is no standby to fail over to.<\/li>\n<li><strong>04:00:00<\/strong>: Total blackout. Internal VPN is down. Public-facing API is down. The status page is down because it was hosted on the same infrastructure it was supposed to monitor.<\/li>\n<\/ul>\n<p>The logs from the initial failure were a mess of <code>AccessDenied<\/code> and <code>ResourceNotFoundException<\/code>. Here is what I saw when I finally regained read-only access to the CloudTrail logs:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">{\n    &quot;eventVersion&quot;: &quot;1.08&quot;,\n    &quot;userIdentity&quot;: {\n        &quot;type&quot;: &quot;AssumedRole&quot;,\n        &quot;principalId&quot;: &quot;AROAEXAMPLE:dev-session-123&quot;,\n        &quot;arn&quot;: &quot;arn:aws:sts::123456789012:assumed-role\/FullAdminDev\/dev-session-123&quot;\n    },\n    &quot;eventTime&quot;: &quot;2023-10-14T02:50:12Z&quot;,\n    &quot;eventSource&quot;: &quot;s3.amazonaws.com&quot;,\n    &quot;eventName&quot;: &quot;DeleteBucket&quot;,\n    &quot;requestParameters&quot;: {\n        &quot;bucketName&quot;: &quot;prod-terraform-state-us-east-1&quot;\n    },\n    &quot;responseElements&quot;: null,\n    &quot;userAgent&quot;: &quot;aws-cli\/2.13.5 Python\/3.11.4 Linux\/5.15.0-76-generic exe\/x86_64.ubuntu.22&quot;,\n    &quot;errorCode&quot;: &quot;AccessDenied&quot;,\n    &quot;errorMessage&quot;: &quot;Access Denied&quot;\n}\n<\/code><\/pre>\n<p>Wait, look at that log. It says <code>AccessDenied<\/code> for the bucket deletion, but the script didn&#8217;t stop there. It proceeded to purge the <em>objects<\/em> within the bucket because the IAM policy was so poorly constructed that while it couldn&#8217;t delete the bucket itself, it had <code>s3:DeleteObject<\/code> on <code>*<\/code>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_The_IAM_Policy_That_Ate_the_Production_Environment\"><\/span>H2: The IAM Policy That Ate the Production Environment<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We talk about the &#8220;Principle of Least Privilege&#8221; like it\u2019s some optional suggestion, like &#8220;floss every day&#8221; or &#8220;don&#8217;t eat raw cookie dough.&#8221; It\u2019s not. It\u2019s the only thing standing between a typo and a bankruptcy filing. <\/p>\n<p>The &#8220;FullAdminDev&#8221; role used in the incident was a relic of the &#8220;we need to move fast&#8221; era. It was a JSON monstrosity that allowed anyone in the engineering org to do anything. We weren&#8217;t following <strong>aws best<\/strong> security standards; we were running a digital Wild West.<\/p>\n<p>Here is the policy that allowed the destruction of our state files:<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n    &quot;Version&quot;: &quot;2012-10-17&quot;,\n    &quot;Statement&quot;: [\n        {\n            &quot;Effect&quot;: &quot;Allow&quot;,\n            &quot;Action&quot;: [\n                &quot;s3:*&quot;,\n                &quot;ec2:*&quot;,\n                &quot;rds:*&quot;,\n                &quot;iam:*&quot;\n            ],\n            &quot;Resource&quot;: &quot;*&quot;\n        }\n    ]\n}\n<\/code><\/pre>\n<p>This is not a policy. This is a suicide note. By using <code>Resource: \"*\"<\/code>, we gave a single compromised key the power to wipe out the entire VPC, delete the RDS snapshots, and\u2014most cruelly\u2014modify the very IAM roles we would need to fix the mess. <\/p>\n<p>When the script ran, it didn&#8217;t just delete the Terraform state. It &#8220;drifted&#8221; the entire environment. When I tried to run a <code>terraform plan<\/code> from my local machine (using a backup state file I had to pull from a physical drive like it was 1998), the output was a horror show.<\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">Terraform v1.5.7\non linux_amd64\nConfiguring remote state backend...\nInitializing modules...\n\nTerraform used the selected providers to generate the following execution plan. \nResource actions are indicated with the following symbols:\n  - destroy\n +\/- read only and then update\n\nTerraform will perform the following actions:\n\n  # module.vpc.aws_vpc.main will be updated in-place\n  ~ resource &quot;aws_vpc&quot; &quot;main&quot; {\n        id                               = &quot;vpc-0a1b2c3d4e5f6g7h8&quot;\n      ~ enable_dns_support               = true -&gt; false\n      - tags                             = {\n          &quot;Environment&quot; = &quot;Production&quot;\n          &quot;ManagedBy&quot;   = &quot;Terraform&quot;\n        } -&gt; null\n    }\n\n  # module.rds.aws_db_instance.primary will be destroyed\n  # (because aws_db_instance.primary is not in the state)\n  - resource &quot;aws_db_instance&quot; &quot;primary&quot; {\n      - id = &quot;prod-db-instance&quot; -&gt; null\n    }\n\nPlan: 0 to add, 1 to change, 45 to destroy.\n<\/code><\/pre>\n<p>Forty-five resources to destroy. The state was gone, the tags were gone, and the infrastructure was &#8220;orphaned.&#8221; We had no way to know what was actually running versus what Terraform <em>thought<\/em> was running. This is what happens when you treat IAM as an afterthought.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_Networking_Spaghetti_and_the_VPC_Peering_Nightmare\"><\/span>H2: Networking Spaghetti and the VPC Peering Nightmare<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If the IAM failure was the spark, the networking configuration was the gasoline. For reasons that defy logic, our VPC was designed with a <code>\/16<\/code> CIDR block that overlapped with our legacy on-premise data center. To &#8220;fix&#8221; this, someone had implemented a series of VPC peering connections and static routes that looked like a bowl of spaghetti thrown against a wall.<\/p>\n<p>During the meltdown, the routing tables were corrupted. The &#8220;clever&#8221; script had iterated through all route tables and removed any route it didn&#8217;t recognize. Because our routing tables were modified manually over the last two years and never checked back into Git, the script saw them as &#8220;drift&#8221; and nuked them.<\/p>\n<p>I spent two hours trying to figure out why I couldn&#8217;t SSH into the bastion host. The reason? The route to the Internet Gateway (IGW) was gone.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ aws ec2 describe-route-tables --route-table-ids rtb-0492834092834 --profile prod\n{\n    &quot;RouteTables&quot;: [\n        {\n            &quot;Associations&quot;: [],\n            &quot;PropagatingVgws&quot;: [],\n            &quot;RouteTableId&quot;: &quot;rtb-0492834092834&quot;,\n            &quot;Routes&quot;: [\n                {\n                    &quot;DestinationCidrBlock&quot;: &quot;10.0.0.0\/16&quot;,\n                    &quot;GatewayId&quot;: &quot;local&quot;,\n                    &quot;Origin&quot;: &quot;CreateRouteTable&quot;,\n                    &quot;State&quot;: &quot;active&quot;\n                }\n            ],\n            &quot;Tags&quot;: [],\n            &quot;VpcId&quot;: &quot;vpc-0a1b2c3d4e5f6g7h8&quot;\n        }\n    ]\n}\n<\/code><\/pre>\n<p>Notice anything missing? The <code>0.0.0.0\/0<\/code> route to the IGW is gone. The subnet was effectively blackholed. We had no ingress, no egress, and no hope. We were running in a dark room with no doors. <\/p>\n<p>The IP exhaustion was the final nail. Because we hadn&#8217;t planned our subnets correctly, the &#8220;death spiral&#8221; of the ASG (which I&#8217;ll get to in a moment) consumed every available IP in the private subnets. New instances couldn&#8217;t start because there were no addresses left in the pool. We were hit by a cascading failure where the networking layer was actively preventing the compute layer from recovering.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_Why_Your_Auto-Scaling_Group_is_a_Lie\"><\/span>H2: Why Your Auto-Scaling Group is a Lie<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Everyone loves Auto-Scaling Groups until they actually have to scale. Our ASG configuration was a masterclass in how <em>not<\/em> to build resilient systems. <\/p>\n<p>First, the health checks. We were using EC2 health checks instead of ELB health checks. For the uninitiated, an EC2 health check only cares if the VM is powered on. It doesn&#8217;t care if the Java application inside is stuck in a garbage collection loop or if the disk is 100% full. The instances were &#8220;healthy&#8221; according to AWS, but they were returning 500 errors to every single user.<\/p>\n<p>When we finally switched to ELB health checks mid-crisis, the ASG realized every single instance was failing. It did what it was programmed to do: it terminated all of them at once. <\/p>\n<p>This is called &#8220;flapping.&#8221; The ASG kills an instance, starts a new one, the new one fails the health check because the database is down, the ASG kills it again. Repeat until your AWS bill is the size of a small nation&#8217;s GDP.<\/p>\n<p>I ran this command to see the carnage:<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ aws autoscaling describe-scaling-activities --auto-scaling-group-name prod-api-asg --max-items 5\n{\n    &quot;Activities&quot;: [\n        {\n            &quot;ActivityId&quot;: &quot;82374-2342-234-234234&quot;,\n            &quot;AutoScalingGroupName&quot;: &quot;prod-api-asg&quot;,\n            &quot;Cause&quot;: &quot;At 2023-10-14T03:45:12Z an instance was started in response to a difference between desired and actual capacity.&quot;,\n            &quot;StartTime&quot;: &quot;2023-10-14T03:45:12.123Z&quot;,\n            &quot;EndTime&quot;: &quot;2023-10-14T03:46:01.000Z&quot;,\n            &quot;StatusCode&quot;: &quot;Failed&quot;,\n            &quot;StatusMessage&quot;: &quot;Instance i-0abcd1234efgh5678 was terminated because it failed ELB health checks.&quot;\n        }\n    ]\n}\n<\/code><\/pre>\n<p>The &#8220;Cause&#8221; was always the same. The instances couldn&#8217;t connect to the RDS instance because the RDS instance was in a different circle of hell. But the ASG didn&#8217;t know that. It just kept throwing wood into the fire, hoping the fire would eventually put itself out. <\/p>\n<p>We also had no &#8220;Cooldown&#8221; period defined. As soon as one instance died, another was spun up. We hit our account limits for EC2 instances within twenty minutes. When I finally tried to manually scale up a fleet of &#8220;safe&#8221; instances, I was met with: <code>You have reached your quota of Max Instances<\/code>. <\/p>\n<p>We were locked out of our own account by our own incompetence.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_The_Database_Deadlock_RDS_and_the_Missing_Multi-AZ\"><\/span>H2: The Database Deadlock: RDS and the Missing Multi-AZ<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Now we get to the heart of the data loss. Someone\u2014and I have the Jira ticket saved for my legal defense\u2014decided that we didn&#8217;t need Multi-AZ for the production RDS instance. &#8220;It&#8217;s twice the price,&#8221; they said. &#8220;The SLA is good enough,&#8221; they said.<\/p>\n<p>The SLA is a refund, not a time machine. When the underlying hardware in <code>us-east-1a<\/code> failed at 03:22 UTC, the database went offline. In a Multi-AZ setup, AWS would have detected this and flipped the DNS record to the standby in <code>us-east-1b<\/code> within sixty seconds. <\/p>\n<p>Instead, we had a &#8220;Single-AZ&#8221; instance that was now a brick. <\/p>\n<p>Because the script earlier had also &#8220;cleaned up&#8221; old snapshots to save on storage costs (again, &#8220;cost-saving&#8221; will be the death of this company), our latest usable snapshot was from 24 hours ago. We didn&#8217;t just lose uptime; we lost a full day of customer transactions. <\/p>\n<p>I had to manually initiate a restore from a snapshot. Do you know how long it takes to restore a 2TB Postgres database from an S3 snapshot during a regional brownout? It takes four hours and twelve minutes. Four hours of sitting in a Zoom room with executives who are asking &#8220;Is it done yet?&#8221; every thirty seconds.<\/p>\n<p>The <strong>aws best<\/strong> practice here isn&#8217;t just &#8220;turn on Multi-AZ.&#8221; It&#8217;s &#8220;implement a multi-region failover strategy with automated point-in-time recovery (PITR).&#8221; We had neither. We had a single point of failure that we had intentionally weakened to save $400 a month.<\/p>\n<p>While the restore was running, I checked the logs for the RDS instance. The IOPS were flatlined. We were using <code>gp2<\/code> volumes instead of <code>gp3<\/code> or <code>io2<\/code>, and we had exhausted our burst balance. The database wasn&#8217;t just down; it was suffocating.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ aws rds describe-db-instances --db-instance-identifier prod-db-primary\n{\n    &quot;DBInstances&quot;: [\n        {\n            &quot;DBInstanceIdentifier&quot;: &quot;prod-db-primary&quot;,\n            &quot;DBInstanceStatus&quot;: &quot;failed&quot;,\n            &quot;Engine&quot;: &quot;postgres&quot;,\n            &quot;EngineVersion&quot;: &quot;15.3&quot;,\n            &quot;MultiAZ&quot;: false,\n            &quot;StorageType&quot;: &quot;gp2&quot;,\n            &quot;AllocatedStorage&quot;: 2000,\n            &quot;PendingModifiedValues&quot;: {}\n        }\n    ]\n}\n<\/code><\/pre>\n<p>The status &#8220;failed&#8221; is a very lonely word to see at 4:00 AM.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_Hard_Lessons_and_the_Path_to_Infrastructure_Nihilism\"><\/span>H2: Hard Lessons and the Path to Infrastructure Nihilism<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We are currently back online, but we are running on &#8220;hope.&#8221; The infrastructure is a patchwork of manual fixes, temporary security groups, and a Terraform state file that I had to manually reconstruct using <code>terraform import<\/code> for over three hundred resources. <\/p>\n<p>I am tired. I am cynical. And I am finished with &#8220;clever&#8221; solutions. <\/p>\n<p>If we are to survive as an engineering organization, we must stop treating our infrastructure like a playground. We are not &#8220;embarking&#8221; on a journey; we are digging ourselves out of a hole. <\/p>\n<p>Here are my non-negotiable demands:<\/p>\n<ol>\n<li><strong>Total Terraform Rewrite<\/strong>: Every single resource must be defined in Terraform v1.5.7 or higher. No more manual changes. If it\u2019s not in Git, it doesn\u2019t exist. We will use <code>prevent_destroy<\/code> on all critical resources, including RDS and S3 state buckets.<\/li>\n<li><strong>IAM Overhaul<\/strong>: We will implement Service Control Policies (SCPs) at the AWS Organizations level to prevent anyone\u2014including me\u2014from deleting core infrastructure. We will move to short-lived credentials using AWS IAM Identity Center. No more long-lived access keys on developer laptops.<\/li>\n<li><strong>Observability as a Requirement<\/strong>: If a service doesn&#8217;t have a structured JSON log output, a health check endpoint that actually checks dependencies, and a CloudWatch dashboard, it doesn&#8217;t get deployed. <\/li>\n<li><strong>Redundancy is Not Optional<\/strong>: Multi-AZ is the bare minimum. We will begin testing cross-region failover for our core API. If the business thinks it\u2019s too expensive, they can calculate the cost of 8 hours of total downtime and get back to me.<\/li>\n<li><strong>Blast Radius Reduction<\/strong>: We will split our single &#8220;Production&#8221; account into multiple accounts using AWS Control Tower. Billing, Logging, Security, and Application workloads will live in separate sandboxes. If a dev key is compromised in the future, it should only be able to burn down a small shed, not the entire skyscraper.<\/li>\n<\/ol>\n<p>The &#8220;Great Infrastructure Meltdown of 2023&#8221; was not an act of God. It was not a &#8220;glitch&#8221; in AWS. It was a failure of engineering discipline. We prioritized speed over stability, and we paid the price in reputation and stress. <\/p>\n<p>I am going to sleep now. When I wake up, I expect to see a pull request for the new IAM permission boundaries. If I see a <code>Resource: \"*\"<\/code> in that PR, I\u2019m retiring to Vermont to milk cows. Cows don&#8217;t have APIs. Cows don&#8217;t have overlapping CIDR blocks. And cows certainly don&#8217;t page you at 3:00 AM because someone deleted the state file.<\/p>\n<p>Fix it. All of it. Now.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/print-awesome-ascii-text-in-linux-terminal\/\">Print Awesome Ascii Text In Linux Terminal<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/react-best-practices-guide\/\">React Best Practices Guide<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/kali-linux-virtualbox-installation\/\">Kali Linux Virtualbox Installation<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This is not a &#8220;retrospective.&#8221; It is not a &#8220;learning opportunity.&#8221; It is a post-mortem of a preventable disaster that cost this company six figures in lost revenue and cost me ten years of my life expectancy. If I see one more &#8220;move fast and break things&#8221; sticker on a laptop in this office, I &#8230; <a title=\"AWS Best Practices: Optimize Your Cloud Infrastructure\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\" aria-label=\"Read more  on AWS Best Practices: Optimize Your Cloud Infrastructure\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4795","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>AWS Best Practices: Optimize Your Cloud Infrastructure - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"AWS Best Practices: Optimize Your Cloud Infrastructure - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"This is not a &#8220;retrospective.&#8221; It is not a &#8220;learning opportunity.&#8221; It is a post-mortem of a preventable disaster that cost this company six figures in lost revenue and cost me ten years of my life expectancy. If I see one more &#8220;move fast and break things&#8221; sticker on a laptop in this office, I ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-22T17:14:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"AWS Best Practices: Optimize Your Cloud Infrastructure\",\"datePublished\":\"2026-05-22T17:14:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\"},\"wordCount\":2019,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\",\"name\":\"AWS Best Practices: Optimize Your Cloud Infrastructure - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-05-22T17:14:36+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AWS Best Practices: Optimize Your Cloud Infrastructure\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"AWS Best Practices: Optimize Your Cloud Infrastructure - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/","og_locale":"en_US","og_type":"article","og_title":"AWS Best Practices: Optimize Your Cloud Infrastructure - ITSupportWale","og_description":"This is not a &#8220;retrospective.&#8221; It is not a &#8220;learning opportunity.&#8221; It is a post-mortem of a preventable disaster that cost this company six figures in lost revenue and cost me ten years of my life expectancy. If I see one more &#8220;move fast and break things&#8221; sticker on a laptop in this office, I ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-05-22T17:14:36+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"AWS Best Practices: Optimize Your Cloud Infrastructure","datePublished":"2026-05-22T17:14:36+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/"},"wordCount":2019,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/","url":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/","name":"AWS Best Practices: Optimize Your Cloud Infrastructure - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-05-22T17:14:36+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/aws-best-practices-optimize-your-cloud-infrastructure\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"AWS Best Practices: Optimize Your Cloud Infrastructure"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4795","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4795"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4795\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4795"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4795"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4795"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}