{"id":4766,"date":"2026-04-20T21:48:38","date_gmt":"2026-04-20T16:18:38","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/"},"modified":"2026-04-20T21:48:38","modified_gmt":"2026-04-20T16:18:38","slug":"10-devops-best-practices-for-faster-software-delivery-3","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/","title":{"rendered":"10 DevOps Best Practices for Faster Software Delivery"},"content":{"rendered":"<p><strong>INCIDENT LOG: OCTOBER 14, 2023 \u2013 THE DAY THE YAML SCREAMED<\/strong><\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">[14:02:11] [INFO] CI\/CD Pipeline #8842 initiated by 'j-dev-99'. Branch: 'fix\/cleanup-unused-resources'.\n[14:04:45] [DEBUG] Terraform v1.7.0: Initializing provider plugins...\n[14:05:12] [WARN] Terraform: Plan shows 142 resources to be deleted.\n[14:05:13] [INFO] CI\/CD: Manual approval bypassed. (Flag: --auto-approve-in-prod set to 'true' by 'j-dev-99').\n[14:05:20] [ERROR] terraform-provider-aws: Deleting rds_instance.prod_db_primary...\n[14:05:45] [CRITICAL] RDS: Instance 'prod-db-01' deleted. No final snapshot requested.\n[14:06:01] [ALERT] Prometheus: ALERTS{alertname=&quot;PostgresDown&quot;, severity=&quot;critical&quot;} fired.\n[14:06:15] [SYSTEM] K8s v1.29: Pod 'api-gateway-7f8d9b' entering CrashLoopBackOff. Reason: ConnectionRefused.\n[14:08:30] [KERNEL] [77482.12] Out of memory: Kill process 12442 (node) score 950 or sacrifice child.\n[14:08:31] [KERNEL] [77482.15] oom_reaper: reaped process 12442 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB\n[14:10:00] [NAGIOS] CRITICAL: Production API is 100% packet loss.\n[14:12:45] [SLACK] #ops-war-room: Gus, are you awake? Everything is gone.\n<\/code><\/pre>\n<hr \/>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69e9fb60edad1\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69e9fb60edad1\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#THE_AUTOPSY_WHY_YOUR_%E2%80%9CAGILE%E2%80%9D_WORKFLOW_IS_A_TRASH_FIRE\" >THE AUTOPSY: WHY YOUR &#8220;AGILE&#8221; WORKFLOW IS A TRASH FIRE<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#THE_MANIFESTO_REAL-WORLD_DISCIPLINE_FOR_THE_MODERN_SYSTEMS_ENGINEER\" >THE MANIFESTO: REAL-WORLD DISCIPLINE FOR THE MODERN SYSTEMS ENGINEER<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#1_Infrastructure_as_Code_is_a_Loaded_Gun_State_Management_and_Locking\" >1. Infrastructure as Code is a Loaded Gun (State Management and Locking)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#2_Kubernetes_is_Not_a_Magic_Wand_Resource_Limits_and_the_OOM_Killer\" >2. Kubernetes is Not a Magic Wand (Resource Limits and the OOM Killer)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#3_Observability_is_More_Than_Just_Pretty_Dashboards\" >3. Observability is More Than Just Pretty Dashboards<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#4_The_CICD_Pipeline_The_Great_Security_Hole\" >4. The CI\/CD Pipeline: The Great Security Hole<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#5_Networking_You_Cant_Abstraction-Layer_Your_Way_Out_of_Physics\" >5. Networking: You Can&#8217;t Abstraction-Layer Your Way Out of Physics<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#6_The_Human_Element_Documentation_and_Toil\" >6. The Human Element: Documentation and Toil<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#THE_DEEP_DIVE_STATEFULNESS_IN_AN_EPHEMERAL_WORLD\" >THE DEEP DIVE: STATEFULNESS IN AN EPHEMERAL WORLD<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#THE_KERNEL_OF_THE_TRUTH_WHY_YOUR_STACK_IS_SLOW\" >THE KERNEL OF THE TRUTH: WHY YOUR STACK IS SLOW<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#GET_OFF_MY_LAWN_A_FINAL_WARNING\" >GET OFF MY LAWN: A FINAL WARNING<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"THE_AUTOPSY_WHY_YOUR_%E2%80%9CAGILE%E2%80%9D_WORKFLOW_IS_A_TRASH_FIRE\"><\/span>THE AUTOPSY: WHY YOUR &#8220;AGILE&#8221; WORKFLOW IS A TRASH FIRE<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>I\u2019ve been staring at green phosphors and liquid crystal displays since before some of you were born. I\u2019ve seen the transition from physical rack-and-stack to the ephemeral nightmare we call &#8220;The Cloud.&#8221; And let me tell you, the Great 2023 Meltdown wasn&#8217;t an accident. It was a mathematical certainty.<\/p>\n<p>The log above is the result of what happens when you give a toddler a flamethrower. We had a junior developer\u2014bless his heart, he\u2019s &#8220;certified&#8221; in three different clouds but couldn&#8217;t tell you what an inode is\u2014who decided to &#8220;clean up&#8221; the Terraform state. Because we\u2019ve replaced actual systems engineering with &#8220;copy-pasting YAML from StackOverflow,&#8221; he didn&#8217;t realize that the <code>prevent_destroy<\/code> lifecycle hook had been commented out &#8220;temporarily&#8221; six months ago during a migration.<\/p>\n<p>When the RDS instance vanished, the application layer didn&#8217;t just fail; it panicked. Every single Node.js pod in our Kubernetes v1.29 cluster started a retry loop with zero exponential backoff. Within ninety seconds, we hit TCP socket exhaustion. The kernel, trying to keep up with the thousands of <code>SYN_SENT<\/code> states, started eating memory like a starving hog. Then the OOM Killer stepped in, and instead of a graceful failure, we had a cluster-wide execution squad.<\/p>\n<p>This happened because you lot think &#8220;DevOps&#8221; is a job title or a set of tools. It isn&#8217;t. It\u2019s a discipline of paranoia. You want to talk about &#8220;devops best&#8221; practices? Fine. Sit down, shut up, and learn how to actually run a system that doesn&#8217;t fall over when someone sneezes on the CI\/CD pipeline.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"THE_MANIFESTO_REAL-WORLD_DISCIPLINE_FOR_THE_MODERN_SYSTEMS_ENGINEER\"><\/span>THE MANIFESTO: REAL-WORLD DISCIPLINE FOR THE MODERN SYSTEMS ENGINEER<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<h2><span class=\"ez-toc-section\" id=\"1_Infrastructure_as_Code_is_a_Loaded_Gun_State_Management_and_Locking\"><\/span>1. Infrastructure as Code is a Loaded Gun (State Management and Locking)<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Everyone loves Terraform until the state file gets corrupted or someone runs a plan that deletes the VPC. In the 2023 meltdown, the primary failure was the bypass of the state lock and the lack of a &#8220;human-in-the-loop&#8221; for destructive changes.<\/p>\n<p>If you are using Terraform v1.7.0, you have no excuse for not using <code>removed<\/code> blocks to refactor instead of just deleting resources. Furthermore, your state must be stored in a backend that supports strong consistency and locking (like S3 with DynamoDB). But more importantly, you need to treat your production environment like a nuclear reactor. You don&#8217;t just &#8220;auto-approve&#8221; changes to the core data layer.<\/p>\n<pre class=\"codehilite\"><code class=\"language-hcl\"># This is what should have been in the RDS module\nresource &quot;aws_db_instance&quot; &quot;prod_db&quot; {\n  identifier           = &quot;prod-db-01&quot;\n  # ... other config ...\n\n  lifecycle {\n    prevent_destroy = true # YOU DO NOT REMOVE THIS WITHOUT A SIGNED WAIVER\n  }\n}\n<\/code><\/pre>\n<p>The &#8220;devops best&#8221; approach here is to implement a &#8220;Two-Key&#8221; system. No destructive action on a production environment should be possible via a single CI\/CD token. You need a separate, highly restricted pipeline for state-changing operations that require a manual cryptographic sign-off from a senior engineer who actually understands the dependency graph.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"2_Kubernetes_is_Not_a_Magic_Wand_Resource_Limits_and_the_OOM_Killer\"><\/span>2. Kubernetes is Not a Magic Wand (Resource Limits and the OOM Killer)<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The meltdown was exacerbated by the fact that our K8s v1.29 cluster was configured by someone who thinks &#8220;Limits&#8221; and &#8220;Requests&#8221; are suggestions. When the DB went down, the pods started consuming CPU cycles trying to re-establish TLS handshakes. Because the <code>memory.limit<\/code> was set too close to the <code>memory.request<\/code>, the Linux kernel&#8217;s Out-Of-Memory (OOM) killer started sniping processes.<\/p>\n<p>You need to understand <code>oom_score_adj<\/code>. When the kernel is low on memory, it looks for processes to kill to save the system. Kubernetes tries to manage this, but if you haven&#8217;t tuned your <code>sysctl<\/code> parameters, the kernel will win.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Check the OOM score of a running container process\ncat \/proc\/$(pgrep node)\/oom_score\n# If this is high, your process is the first to die.\n<\/code><\/pre>\n<p>In a &#8220;devops best&#8221; scenario, you don&#8217;t just set limits; you profile your application under failure conditions. What happens to your memory footprint when the backend is unreachable? If it spikes, your &#8220;self-healing&#8221; cluster will just become a &#8220;self-destructing&#8221; cluster as it enters a death spiral of killing and restarting pods, putting even more load on the API server and the Kubelet.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"3_Observability_is_More_Than_Just_Pretty_Dashboards\"><\/span>3. Observability is More Than Just Pretty Dashboards<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>During the outage, the &#8220;DevOps Team&#8221; was staring at a Grafana dashboard that was showing 100% CPU usage. No kidding. We knew it was broken. What we didn&#8217;t know was <em>why<\/em> the network stack was dropping packets before they even reached the application.<\/p>\n<p>We had to drop into the shell and look at the actual kernel metrics. We found that <code>net.ipv4.tcp_max_syn_backlog<\/code> was peaked. The &#8220;shiny&#8221; monitoring tools didn&#8217;t catch this because they were only looking at the application layer.<\/p>\n<pre class=\"codehilite\"><code class=\"language-promql\"># A real query to find socket exhaustion before it kills you\nrate(node_netstat_Tcp_Ext_ListenDrops[5m]) &gt; 0\n<\/code><\/pre>\n<p>If you aren&#8217;t monitoring <code>TIME_WAIT<\/code> sockets and <code>ListenDrops<\/code>, you aren&#8217;t doing &#8220;devops best&#8221; monitoring; you&#8217;re just playing with crayons. You need to be looking at the <code>conntrack<\/code> table size. If your microservices are creating thousands of short-lived connections, you will hit the <code>nf_conntrack_max<\/code> limit, and your &#8220;highly available&#8221; system will start dropping packets like a lead balloon.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"4_The_CICD_Pipeline_The_Great_Security_Hole\"><\/span>4. The CI\/CD Pipeline: The Great Security Hole<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The junior dev bypassed the manual approval because the CI\/CD configuration (a 500-line YAML file that no one fully understands) had a conditional logic error. It checked if the branch name started with <code>fix\/<\/code> and, if so, skipped the approval step to &#8220;increase velocity.&#8221;<\/p>\n<p>Velocity is how fast you hit the wall.<\/p>\n<p>A &#8220;devops best&#8221; pipeline is built on the principle of least privilege. The CI\/CD runner should not have <code>AdministratorAccess<\/code> to your AWS account. It should have a scoped IAM role that can only modify specific resource types. And for the love of Ken Thompson, use OpenID Connect (OIDC) instead of long-lived access keys stored in GitHub secrets.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># Example of a hardened GitHub Action step\npermissions:\n  id-token: write\n  contents: read\n# Use OIDC to get temporary credentials, don't store static keys!\n<\/code><\/pre>\n<p>If your pipeline can delete a database, and that pipeline can be triggered by a single git push to a branch with a specific name, you don&#8217;t have a workflow; you have a vulnerability.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"5_Networking_You_Cant_Abstraction-Layer_Your_Way_Out_of_Physics\"><\/span>5. Networking: You Can&#8217;t Abstraction-Layer Your Way Out of Physics<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The 2023 meltdown showed a complete lack of understanding of the OSI model. When the pods started failing, the Ingress Controller (Nginx) began returning 504s. But because the internal service mesh was trying to be &#8220;smart,&#8221; it kept retrying the connection, which led to an amplification attack against our own internal infrastructure.<\/p>\n<p>We saw <code>net.core.somaxconn<\/code> limits being hit on the worker nodes. The default value is often 128 or 4069, which is laughable for a high-traffic production environment.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Tune the kernel for high-concurrency \nsysctl -w net.core.somaxconn=65535\nsysctl -w net.ipv4.ip_local_port_range=&quot;1024 65535&quot;\nsysctl -w net.ipv4.tcp_tw_reuse=1\n<\/code><\/pre>\n<p>&#8220;Devops best&#8221; practices require you to understand that underneath your &#8220;serverless&#8221; functions and &#8220;containerized&#8221; apps, there is a Linux kernel trying to manage buffers and file descriptors. If you don&#8217;t tune <code>fs.file-max<\/code>, your &#8220;infinitely scalable&#8221; app will die at 1,024 open files.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"6_The_Human_Element_Documentation_and_Toil\"><\/span>6. The Human Element: Documentation and Toil<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When the database died, no one knew where the latest manual backup was. Why? Because the &#8220;automated&#8221; backup script had been failing for three weeks, and the alerts were being routed to a Slack channel that everyone had muted because of &#8220;noise.&#8221;<\/p>\n<p>This is &#8220;toil&#8221;\u2014the kind of mind-numbing, manual work that kills engineering soul. We had a &#8220;Runbook,&#8221; but it was a Confluence page that hadn&#8217;t been updated since 2019.<\/p>\n<p>A &#8220;devops best&#8221; approach to documentation is &#8220;Documentation as Code.&#8221; If your recovery procedure isn&#8217;t a script that is tested weekly in a staging environment, it doesn&#8217;t exist. You don&#8217;t have a backup unless you have successfully performed a restore in the last seven days. Period.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"THE_DEEP_DIVE_STATEFULNESS_IN_AN_EPHEMERAL_WORLD\"><\/span>THE DEEP DIVE: STATEFULNESS IN AN EPHEMERAL WORLD<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Let&#8217;s talk about the &#8220;State&#8221; problem. The industry has spent the last decade trying to pretend that state doesn&#8217;t exist. &#8220;Make everything stateless!&#8221; they cry. But at the bottom of every stack, there is a disk. And that disk has bits on it that you cannot afford to lose.<\/p>\n<p>In the 2023 meltdown, we lost the primary RDS instance. The &#8220;shiny&#8221; solution would have been a Multi-AZ failover. But guess what? The Terraform change deleted the <em>entire<\/em> DB cluster, including the replicas. Because the code defined the cluster, and the code said &#8220;delete,&#8221; the cloud provider dutifully obeyed.<\/p>\n<p>This is where the &#8220;devops best&#8221; practice of <strong>Data Gravity<\/strong> and <strong>State Isolation<\/strong> comes in. Your data layer should never be in the same lifecycle as your application layer. You don&#8217;t put your RDS instance in the same Terraform module as your EKS cluster. You isolate the data. You create it once, you protect it with every lock available, and you treat it as a permanent fixture.<\/p>\n<p>If you had used Terraform 1.7.0&#8217;s <code>import<\/code> blocks correctly, or used <code>moved<\/code> blocks to rename resources without destruction, we wouldn&#8217;t have been in that mess. But no, everyone wanted to &#8220;move fast.&#8221;<\/p>\n<pre class=\"codehilite\"><code class=\"language-hcl\"># Terraform 1.7.0 refactoring - use this instead of deleting!\nmoved {\n  from = aws_db_instance.old_identifier\n  to   = aws_db_instance.new_identifier\n}\n<\/code><\/pre>\n<p>And let\u2019s talk about the disk I\/O. When we finally got a new instance up and started the restore from a snapshot, the performance was abysmal. Why? Because of EBS burst balances and I\/O credits. The junior devs were confused\u2014&#8221;But we have 1TB of storage!&#8221; Yeah, but you&#8217;re on a <code>gp2<\/code> volume that you just initialized, and you&#8217;re hitting the &#8220;first-touch&#8221; penalty where every block has to be pulled from S3.<\/p>\n<p>If you knew your &#8220;devops best&#8221; practices, you&#8217;d know to use <code>gp3<\/code> with provisioned throughput or to &#8220;warm&#8221; your EBS volumes before throwing production traffic at them. But that requires reading the documentation, and who has time for that when there are new JavaScript frameworks to learn?<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"THE_KERNEL_OF_THE_TRUTH_WHY_YOUR_STACK_IS_SLOW\"><\/span>THE KERNEL OF THE TRUTH: WHY YOUR STACK IS SLOW<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>During the recovery, we noticed that even after the DB was back, the API response times were triple what they should be. The &#8220;DevOps&#8221; team suggested adding more nodes to the K8s cluster. More bloat. More cruft.<\/p>\n<p>I logged into a worker node and ran <code>vmstat 1<\/code>. The <code>cs<\/code> (context switches) and <code>in<\/code> (interrupts) columns were off the charts. We were suffering from &#8220;noisy neighbor&#8221; syndrome at the CPU cache level because we had too many small pods crammed onto too few large nodes.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># vmstat output showing high context switching\nprocs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\n r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n 8  0      0 245678  12345 678901    0    0     4    45 9000 15000 45 25 30  0  0\n<\/code><\/pre>\n<p>The &#8220;devops best&#8221; fix wasn&#8217;t &#8220;more nodes.&#8221; It was setting CPU affinity and using the <code>Static<\/code> CPU Manager policy in K8s to give our critical pods exclusive access to physical cores. We also tuned the <code>transparent_hugepages<\/code> setting, which was causing a 10% latency overhead on our Postgres-heavy workloads.<\/p>\n<p>You see, &#8220;DevOps&#8221; isn&#8217;t about the YAML. It&#8217;s about knowing that <code>transparent_hugepages=always<\/code> is a trap for databases. It&#8217;s about knowing that <code>tcp_slow_start_after_idle<\/code> should be set to 0 if you want to maintain throughput on long-lived connections.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"GET_OFF_MY_LAWN_A_FINAL_WARNING\"><\/span>GET OFF MY LAWN: A FINAL WARNING<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>We spent sixteen hours fixing a mess that took sixteen seconds to create. We lost data, we lost money, and more importantly, we lost the trust of our users.<\/p>\n<p>The &#8220;Move Fast and Break Things&#8221; crowd has had their fun. They\u2019ve built a &#8220;tapestry&#8221; (wait, I&#8217;m not allowed to use that word, and I wouldn&#8217;t anyway, it&#8217;s too flowery)\u2014they&#8217;ve built a spaghetti-code nightmare of interconnected microservices that no single human can comprehend. They&#8217;ve replaced &#8220;understanding the system&#8221; with &#8220;restarting the pod.&#8221;<\/p>\n<p>If you want to practice &#8220;devops best&#8221; disciplines, start by respecting the machine. Stop treating the infrastructure as a disposable toy. Learn how the Linux kernel handles memory. Learn how TCP works. Learn why a single point of failure in a CI\/CD pipeline is worse than a single point of failure in a server rack.<\/p>\n<p>Go back to basics. Use Terraform 1.7.0, but use it with the caution of a bomb squad. Use Kubernetes v1.29, but tune your sysctls like you&#8217;re racing a Formula 1 car. And for the love of all that is holy, stop using <code>--auto-approve<\/code> in production.<\/p>\n<p>Now, if you&#8217;ll excuse me, I have some Perl scripts to maintain and a cloud bill to yell at. Don&#8217;t call me unless the kernel panics. And even then, check your <code>dmesg<\/code> first.<\/p>\n<p><strong>Gus<\/strong><br \/>\n<em>Senior Systems Engineer (Retired, but they keep dragging me back)<\/em><br \/>\n<em>October 2023<\/em><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\">10 Devops Best Practices For Faster Software Delivery<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/openvpn-pfsense-2-4-setup-in-simple-steps\/\">Openvpn Pfsense 2 4 Setup In Simple Steps<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/how-to-upgrade-to-python-3-10-on-ubuntu-18-04-and-20-04-lts\/\">How To Upgrade To Python 3 10 On Ubuntu 18 04 And 20 04 Lts<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>INCIDENT LOG: OCTOBER 14, 2023 \u2013 THE DAY THE YAML SCREAMED [14:02:11] [INFO] CI\/CD Pipeline #8842 initiated by &#8216;j-dev-99&#8217;. Branch: &#8216;fix\/cleanup-unused-resources&#8217;. [14:04:45] [DEBUG] Terraform v1.7.0: Initializing provider plugins&#8230; [14:05:12] [WARN] Terraform: Plan shows 142 resources to be deleted. [14:05:13] [INFO] CI\/CD: Manual approval bypassed. (Flag: &#8211;auto-approve-in-prod set to &#8216;true&#8217; by &#8216;j-dev-99&#8217;). [14:05:20] [ERROR] terraform-provider-aws: Deleting &#8230; <a title=\"10 DevOps Best Practices for Faster Software Delivery\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\" aria-label=\"Read more  on 10 DevOps Best Practices for Faster Software Delivery\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4766","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>10 DevOps Best Practices for Faster Software Delivery - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"INCIDENT LOG: OCTOBER 14, 2023 \u2013 THE DAY THE YAML SCREAMED [14:02:11] [INFO] CI\/CD Pipeline #8842 initiated by &#039;j-dev-99&#039;. Branch: &#039;fix\/cleanup-unused-resources&#039;. [14:04:45] [DEBUG] Terraform v1.7.0: Initializing provider plugins... [14:05:12] [WARN] Terraform: Plan shows 142 resources to be deleted. [14:05:13] [INFO] CI\/CD: Manual approval bypassed. (Flag: --auto-approve-in-prod set to &#039;true&#039; by &#039;j-dev-99&#039;). [14:05:20] [ERROR] terraform-provider-aws: Deleting ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-20T16:18:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"10 DevOps Best Practices for Faster Software Delivery\",\"datePublished\":\"2026-04-20T16:18:38+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\"},\"wordCount\":1957,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\",\"name\":\"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-04-20T16:18:38+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"10 DevOps Best Practices for Faster Software Delivery\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/","og_locale":"en_US","og_type":"article","og_title":"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale","og_description":"INCIDENT LOG: OCTOBER 14, 2023 \u2013 THE DAY THE YAML SCREAMED [14:02:11] [INFO] CI\/CD Pipeline #8842 initiated by 'j-dev-99'. Branch: 'fix\/cleanup-unused-resources'. [14:04:45] [DEBUG] Terraform v1.7.0: Initializing provider plugins... [14:05:12] [WARN] Terraform: Plan shows 142 resources to be deleted. [14:05:13] [INFO] CI\/CD: Manual approval bypassed. (Flag: --auto-approve-in-prod set to 'true' by 'j-dev-99'). [14:05:20] [ERROR] terraform-provider-aws: Deleting ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-04-20T16:18:38+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"10 DevOps Best Practices for Faster Software Delivery","datePublished":"2026-04-20T16:18:38+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/"},"wordCount":1957,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/","url":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/","name":"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-04-20T16:18:38+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery-3\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"10 DevOps Best Practices for Faster Software Delivery"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4766","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4766"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4766\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4766"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}