{"id":4483,"date":"2026-01-27T21:08:25","date_gmt":"2026-01-27T15:38:25","guid":{"rendered":"https:\/\/www.itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/"},"modified":"2026-01-27T21:08:25","modified_gmt":"2026-01-27T15:38:25","slug":"10-devops-best-practices-for-faster-software-delivery","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/","title":{"rendered":"10 DevOps Best Practices for Faster Software Delivery"},"content":{"rendered":"<p>text<br \/>\n$ kubectl get pods -n prod<br \/>\nNAME                                     READY   STATUS             RESTARTS         AGE<br \/>\napi-gateway-v2-7f8d9b4c-xhq2z            0\/1     CrashLoopBackOff   42 (3m ago)      14h<br \/>\norder-processor-66d5f4e3-99abc           0\/1     OOMKilled          12 (1m ago)      14h<br \/>\npayment-service-55c2a1b0-zxy98           1\/1     Running            0                14h<br \/>\nmarketing-tracker-88f123a4-bbbbb         1\/1     Running            0                14h<\/p>\n<p>$ kubectl logs -f api-gateway-v2-7f8d9b4c-xhq2z &#8211;previous<br \/>\n{&#8220;level&#8221;:&#8221;fatal&#8221;,&#8221;ts&#8221;:1715432100.123,&#8221;caller&#8221;:&#8221;main.go:45&#8243;,&#8221;msg&#8221;:&#8221;failed to connect to redis&#8221;,&#8221;error&#8221;:&#8221;dial tcp 10.96.0.15:6379: i\/o timeout&#8221;}<br \/>\n{&#8220;level&#8221;:&#8221;info&#8221;,&#8221;ts&#8221;:1715432105.456,&#8221;msg&#8221;:&#8221;Attempting reconnection&#8230; (Attempt 43)&#8221;}<br \/>\npanic: runtime error: invalid memory address or nil pointer dereference<br \/>\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8a2f31]<\/p>\n<p>$ terraform plan<br \/>\n\u2577<br \/>\n\u2502 Error: Error acquiring the state lock<br \/>\n\u2502<br \/>\n\u2502 Error message: conditional check failed<br \/>\n\u2502 Lock Info:<br \/>\n\u2502   ID:        6e2f1a3b-8c9d-4e5f-a1b2-c3d4e5f6a7b8<br \/>\n\u2502   Path:      prod-infrastructure\/terraform.tfstate<br \/>\n\u2502   Operation: OperationTypePlan<br \/>\n\u2502   Who:       jenkins-worker-04@ip-10-0-45-12<br \/>\n\u2502   Version:   1.5.7<br \/>\n\u2502   Created:   2024-05-11 02:14:05.123456 +0000 UTC<br \/>\n\u2502   Info:    <br \/>\n\u2575<\/p>\n<pre class=\"codehilite\"><code>The sun is coming up. Or maybe it\u2019s going down. I can\u2019t tell. The blue light from my triple-monitor setup has burned a permanent rectangular ghost into my retinas. My hands smell like stale coffee and the cheap mechanical keyboard grease that accumulates after 72 hours of frantic typing. \n\nThree days ago, some &quot;Growth Hacker&quot; in a slim-fit suit decided we needed a &quot;Flash Loyalty Reward Event&quot; to coincide with a celebrity tweet. They didn't tell Engineering. They didn't tell SRE. They just pushed the &quot;Go&quot; button on a campaign that hit forty million users simultaneously. \n\nAnd now, here I am. Staring at a terminal that\u2019s screaming at me because our &quot;elastic&quot; infrastructure decided to snap like a dry twig. If this is what you call **devops best**, I\u2019m moving to a farm. I\u2019m done. I\u2019m writing this because if I don\u2019t, the next person who inherits this cluster will probably jump off the roof.\n\n## The YAML Hell We Built for Ourselves\n\nWe\u2019re running Kubernetes v1.29. It\u2019s supposed to be the pinnacle of container orchestration. Instead, it\u2019s a 1.5-million-line YAML suicide note. We\u2019ve abstracted the infrastructure so far away from the hardware that nobody knows where the packets actually go anymore. \n\nWhen the marketing spike hit, the `api-gateway` started failing its liveness probes. Why? Because the &quot;DevOps Architects&quot; decided that every pod needed fifteen sidecars for &quot;observability,&quot; &quot;security,&quot; and &quot;service mesh magic.&quot; By the time a request actually hits the application code, it\u2019s traveled through three proxies, been encrypted and decrypted four times, and had its headers bloated by 4KB of tracing metadata.\n\nThe `kubelet` on worker node `ip-10-0-112-4` decided it had enough. It went `NotReady`. Why? Because the `conntrack` table overflowed. We\u2019re pushing so many tiny, useless UDP packets for &quot;telemetry&quot; that the kernel literally forgot how to talk to the network.\n\nLook at this absolute garbage. This is the &quot;standard&quot; deployment manifest for the service that died first.\n\n```yaml\n# This is the manifest that killed the cluster.\n# &quot;Optimized&quot; by a consultant who left six months ago.\napiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: order-processor\nspec:\n  replicas: 50 # Marketing said we'd need &quot;scale&quot;\n  template:\n    spec:\n      containers:\n      - name: app\n        image: internal-repo\/order-processor:latest # Because versioning is for cowards\n        resources:\n          limits:\n            cpu: &quot;200m&quot; # Not enough to actually process an order\n            memory: &quot;256Mi&quot;\n          requests:\n            cpu: &quot;100m&quot;\n            memory: &quot;128Mi&quot;\n        livenessProbe:\n          httpGet:\n            path: \/healthz\n            port: 8080\n          initialDelaySeconds: 3\n          periodSeconds: 5 # Let's hammer the app while it's struggling to boot\n      - name: telemetry-sidecar\n        image: telemetry-vendor\/agent:v4.2.1\n        resources:\n          limits:\n            cpu: &quot;500m&quot; # The sidecar gets more CPU than the app. Brilliant.\n            memory: &quot;512Mi&quot;\n<\/code><\/pre>\n<p>I spent four hours yesterday just trying to figure out why the <code>order-processor<\/code> was OOMKilled. It turns out the &#8220;telemetry-sidecar&#8221; has a memory leak that triggers whenever the network latency exceeds 50ms. And since the network was saturated by the marketing spike, the sidecar ate all the node&#8217;s memory, the <code>kubelet<\/code> panicked, and the OOM Killer started executing random pods like a firing squad.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69d8715098a53\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69d8715098a53\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#Terraform_v157_State_Drift_is_a_Lifestyle\" >Terraform v1.5.7: State Drift is a Lifestyle<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#Why_Your_CICD_Pipeline_is_a_Rube_Goldberg_Machine\" >Why Your CI\/CD Pipeline is a Rube Goldberg Machine<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#Prometheus_v245_and_the_Lie_of_%E2%80%9CObservability%E2%80%9D\" >Prometheus v2.45 and the Lie of &#8220;Observability&#8221;<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#The_Marketing-Driven_Death_Spiral\" >The Marketing-Driven Death Spiral<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#The_%E2%80%9CDevOps%E2%80%9D_Best_Practices_Myth\" >The &#8220;DevOps&#8221; Best Practices Myth<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#The_Aftermath\" >The Aftermath<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Terraform_v157_State_Drift_is_a_Lifestyle\"><\/span>Terraform v1.5.7: State Drift is a Lifestyle<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We use Terraform v1.5.7 to manage our AWS environment. &#8220;Infrastructure as Code,&#8221; they said. &#8220;It\u2019ll be idempotent,&#8221; they said. <\/p>\n<p>Lies. All of it. <\/p>\n<p>When the outage started, I tried to scale the RDS instance. But I couldn&#8217;t. Why? Because someone\u2014probably a &#8220;Full Stack&#8221; developer who thinks they know CloudFormation\u2014manually changed the security groups in the AWS Console at 4:00 AM. Now the Terraform state is drifted so far it\u2019s in a different zip code. <\/p>\n<p>Every time I run <code>terraform plan<\/code>, it wants to destroy the production database and recreate it because of a &#8220;forced replacement&#8221; on a tag. A TAG. <\/p>\n<p>Here is the HCL that currently manages our VPC. It\u2019s a masterclass in how to build a cage you can\u2019t escape from.<\/p>\n<pre class=\"codehilite\"><code class=\"language-hcl\"># The &quot;Flexible&quot; VPC Module\nresource &quot;aws_db_instance&quot; &quot;main&quot; {\n  allocated_storage    = 100\n  engine               = &quot;postgres&quot;\n  engine_version       = &quot;15.3&quot;\n  instance_class       = &quot;db.m5.large&quot;\n  name                 = &quot;prod_db&quot;\n  username             = var.db_user\n  password             = var.db_pass # Stored in plain text in the tfvars file. Kill me.\n\n  # This block is the reason I haven't slept.\n  # Someone hardcoded the subnet IDs instead of using data lookups.\n  replicate_source_db = null\n  vpc_security_group_ids = [\n    &quot;sg-0a1b2c3d4e5f6g7h8&quot;, \n    &quot;sg-9i0j1k2l3m4n5o6p7&quot; # This SG was deleted manually 3 months ago.\n  ]\n\n  lifecycle {\n    prevent_destroy = false # Why was this set to false in PROD?!\n  }\n}\n<\/code><\/pre>\n<p>I had to manually edit the JSON state file using <code>vim<\/code> while my hands were shaking from too much caffeine. Do you know how terrifying it is to manually delete a dependency line in a 14MB Terraform state file while the CEO is screaming in a Slack &#8220;War Room&#8221; channel? Everyone talks about <strong>devops best<\/strong> practices until the database starts dropping packets and the state lock is held by a Jenkins job that crashed two hours ago.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Why_Your_CICD_Pipeline_is_a_Rube_Goldberg_Machine\"><\/span>Why Your CI\/CD Pipeline is a Rube Goldberg Machine<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Our deployment pipeline is a nightmare of Jenkinsfiles, GitHub Actions, and &#8220;custom&#8221; Bash scripts that have more &#8220;if&#8221; statements than a choose-your-own-adventure novel. <\/p>\n<p>We have this &#8220;automated&#8221; rollback feature. It\u2019s supposed to detect a failure and revert the last commit. During the peak of the outage, the &#8220;automated&#8221; rollback triggered. But because the <code>api-gateway<\/code> was in a <code>CrashLoopBackOff<\/code>, the health check failed, which triggered another rollback. <\/p>\n<p>The system entered a recursive loop of rolling back to versions of the code that didn&#8217;t even exist in the container registry anymore. I had to kill the Jenkins master just to make it stop.<\/p>\n<p>Here is the &#8220;cleanup&#8221; script that runs after every failed deployment. It\u2019s a crime against humanity.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">#!\/bin\/bash\n# &quot;Temporary&quot; fix for orphaned volumes. Added: Oct 2022.\n# Still here. Still breaking things.\n\necho &quot;Starting cleanup of orphaned resources...&quot;\n\n# Find all PVCs that are 'Pending' and delete them.\n# What could go wrong?\nkubectl get pvc --all-namespaces | grep Pending | awk '{print $2}' | xargs kubectl delete pvc\n\n# Force delete pods that are stuck in Terminating\n# Because we don't understand how finalizers work.\nfor pod in $(kubectl get pods --all-namespaces | grep Terminating | awk '{print $2}'); do\n    kubectl delete pod $pod --grace-period=0 --force\ndone\n\n# Check if the database is still alive. \n# If not, just restart the whole node group. \n# (Note: This actually happened during the outage).\nif ! curl -s --connect-timeout 2 http:\/\/db-internal:5432; then\n    echo &quot;DB unreachable. Nuking the worker nodes.&quot;\n    aws autoscaling terminate-instance-in-auto-scaling-group --instance-id $(curl -s http:\/\/169.254.169.254\/latest\/meta-data\/instance-id)\nfi\n<\/code><\/pre>\n<p>Look at that last part. If the database doesn&#8217;t respond to a <code>curl<\/code> on its port (which it won&#8217;t, because it&#8217;s Postgres, not an HTTP server), the script <em>terminates the instance it&#8217;s running on<\/em>. This script was written by a &#8220;Senior DevOps Engineer&#8221; who now works at a crypto startup. I spent two hours wondering why my SSH session kept dropping. It was because the system was literally committing suicide every five minutes.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Prometheus_v245_and_the_Lie_of_%E2%80%9CObservability%E2%80%9D\"><\/span>Prometheus v2.45 and the Lie of &#8220;Observability&#8221;<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We have dashboards. Oh, we have so many dashboards. We have Grafana boards that look like the flight deck of the Space Shuttle. We\u2019re running Prometheus v2.45 with a sidecar for long-term storage. <\/p>\n<p>But here\u2019s the thing: when the system actually fails, the dashboards are the first thing to go. <\/p>\n<p>Prometheus couldn&#8217;t scrape the targets because the service discovery was failing. The service discovery was failing because the Kubernetes API server was overloaded. The API server was overloaded because every single pod was trying to report its own death simultaneously.<\/p>\n<p>I was flying blind. I had to use <code>tcpdump<\/code> on a worker node like it was 1998. <\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Trying to find where the 504s are coming from\n# while the world burns around me.\nsudo tcpdump -i eth0 -A 'tcp port 80 and (dst host 10.0.0.1 or src host 10.0.0.1)' | grep &quot;HTTP\/1.1 504&quot;\n<\/code><\/pre>\n<p>The &#8220;Observability&#8221; stack we spent $200k on last year told me everything was &#8220;Green&#8221; for the first twenty minutes of the outage because the metrics were cached. By the time the alerts fired, the database was already a smoking crater. <\/p>\n<p>We followed the <strong>devops best<\/strong> guide to the letter, and yet, here we are, staring at a 503 and a bunch of empty graphs. &#8220;Observability&#8221; is just a fancy word for &#8220;looking at the wreckage after the plane has already crashed.&#8221; It doesn&#8217;t prevent the crash. It just gives you a high-resolution video of the impact.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Marketing-Driven_Death_Spiral\"><\/span>The Marketing-Driven Death Spiral<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The root cause wasn&#8217;t technical. It never is. The root cause was a Jira ticket titled &#8220;LTY-999: Implement Flash Rewards.&#8221; It was marked as &#8220;Low Effort&#8221; by a Product Manager who hasn&#8217;t seen a line of code since the Obama administration.<\/p>\n<p>They wanted a &#8220;real-time&#8221; leaderboard. To do this, the developers decided to bypass the cache and query the primary database directly every time a user refreshed their profile page. <\/p>\n<p>&#8220;It&#8217;s fine,&#8221; they said. &#8220;We&#8217;ll just use a read replica.&#8221;<\/p>\n<p>But they didn&#8217;t use a read replica. They used the primary. And they didn&#8217;t use an index. They did a full table scan on a table with 500 million rows. <\/p>\n<p>When the celebrity tweeted the link, 100,000 people clicked it. That\u2019s 100,000 full table scans per second. The RDS instance&#8217;s CPU went from 5% to 100% in three seconds. The IOPS hit the burst limit, and the EBS volume throttled. <\/p>\n<p>And because our &#8220;microservices&#8221; are all tightly coupled via synchronous REST calls (another &#8220;architectural&#8221; decision), the entire stack backed up. The <code>order-processor<\/code> waited for the <code>user-service<\/code>, which waited for the <code>leaderboard-service<\/code>, which was stuck waiting for the database. <\/p>\n<p>Every single thread in the entire cluster was blocked. And that\u2019s when the liveness probes started failing. Kubernetes, in its infinite &#8220;wisdom,&#8221; decided the pods were dead and started killing them. But the new pods couldn&#8217;t start because the database was still locked. <\/p>\n<p>It was a circular firing squad of &#8220;cloud-native&#8221; technology.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_%E2%80%9CDevOps%E2%80%9D_Best_Practices_Myth\"><\/span>The &#8220;DevOps&#8221; Best Practices Myth<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I\u2019m tired of hearing about &#8220;DevOps.&#8221; In this company, &#8220;DevOps&#8221; just means &#8220;Operations people doing twice the work for the same pay while Developers get to play with new JS frameworks.&#8221;<\/p>\n<p>We\u2019re told to &#8220;Shift Left.&#8221; That just means I have to teach a 22-year-old how to write a Dockerfile so they don&#8217;t accidentally include their entire <code>Downloads<\/code> folder in the image. We\u2019re told to &#8220;Automate Everything.&#8221; That just means I have to maintain 5,000 lines of Python scripts that glue together tools that were never meant to talk to each other.<\/p>\n<p>Here\u2019s a script I wrote at 4:30 AM this morning to try and recover the corrupted Redis cache. It\u2019s ugly. It\u2019s dangerous. It\u2019s the opposite of &#8220;clean code.&#8221;<\/p>\n<pre class=\"codehilite\"><code class=\"language-python\">import redis\nimport time\n\n# This is a hack. If you are reading this, I am sorry.\n# The marketing-tracker service flooded Redis with 10GB of \n# session data that never expires. This script tries to \n# delete keys matching the pattern 'sess:*' without \n# killing the entire cluster.\n\nr = redis.Redis(host='redis-prod', port=6379, db=0)\n\ndef emergency_cleanup():\n    cursor = '0'\n    while cursor != 0:\n        # Scan instead of KEYS to avoid blocking the event loop\n        # Although at this point, the loop is already dead.\n        cursor, keys = r.scan(cursor=cursor, match='sess:*', count=1000)\n        if keys:\n            r.delete(*keys)\n            print(f&quot;Deleted {len(keys)} keys...&quot;)\n        # Sleep to give Redis a chance to breathe. \n        # As if a piece of software can breathe.\n        time.sleep(0.01)\n\nif __name__ == &quot;__main__&quot;:\n    print(&quot;Starting desperate recovery attempt...&quot;)\n    try:\n        emergency_cleanup()\n    except Exception as e:\n        print(f&quot;Even the recovery script failed: {e}&quot;)\n<\/code><\/pre>\n<p>Your <strong>devops best<\/strong> strategy is just a fancy way of saying &#8220;make the SREs fix it.&#8221; We build these incredibly complex systems so we can feel smart, but we forget that someone has to stay awake for three days when the &#8220;smart&#8221; system decides to eat itself.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Aftermath\"><\/span>The Aftermath<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The outage is &#8220;over.&#8221; I\u2019ve patched the security groups. I\u2019ve manually scaled the RDS instance to a <code>db.r5.4xlarge<\/code> which is costing us $15 an hour. I\u2019ve killed the recursive Jenkins jobs. <\/p>\n<p>The Marketing team is celebrating. They got &#8220;record engagement.&#8221; They\u2019re probably having champagne right now. They have no idea that the only reason the site is back up is because I spent six hours manually editing a Terraform state file and writing a Python script to delete millions of Redis keys.<\/p>\n<p>They\u2019ll do it again next week. They\u2019ll find some new &#8220;engagement&#8221; metric to chase, and they\u2019ll push another &#8220;Low Effort&#8221; ticket that bypasses the staging environment because &#8220;it\u2019s just a minor change.&#8221;<\/p>\n<p>And I\u2019ll be here. Drinking cold coffee. Staring at <code>kubectl get pods -w<\/code>. Waiting for the pager to scream.<\/p>\n<p>If you\u2019re reading this leaked wiki page, do yourself a favor. If your company starts talking about &#8220;digital transformation&#8221; or &#8220;scaling their DevOps culture,&#8221; run. Go find a job where you manage a single Linux server in a closet. At least then, when it breaks, you know why. <\/p>\n<p>I\u2019m going home. I\u2019m turning off my phone. If the cluster dies again, let it. I\u2019ve given 72 hours of my life to a &#8220;Flash Loyalty Reward Event.&#8221; I have nothing left to give. <\/p>\n<p>The YAML is still there. The state drift is still there. The &#8220;temporary&#8221; bash scripts are still there. <\/p>\n<p>Everything is fine. Until it isn&#8217;t.<\/p>\n<hr \/>\n<p><strong>Post-Mortem Action Items (that will never be done):<\/strong><br \/>\n1. Fix the Terraform state hardcoding (Assigned to: Nobody).<br \/>\n2. Implement actual rate limiting on the API (Status: &#8220;In Backlog&#8221; for 2 years).<br \/>\n3. Stop letting Marketing push to Prod (Status: Rejected by VP of Growth).<br \/>\n4. Buy the SRE team a bottle of Scotch (Status: Not in budget).<\/p>\n<p>Now get out of my office. I need to sleep for a decade.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/oracle.itsupportwale.com\/blog\/openstreetmap-api-in-android-open-source-approach-for-free-mapping\/\">Openstreetmap Api In Android Open Source Approach For Free Mapping<\/a><\/li>\n<li><a href=\"https:\/\/oracle.itsupportwale.com\/blog\/how-to-increase-migration-speed-in-office-365\/\">How To Increase Migration Speed In Office 365<\/a><\/li>\n<li><a href=\"https:\/\/oracle.itsupportwale.com\/blog\/10-essential-devops-best-practices-for-faster-delivery\/\">10 Essential Devops Best Practices For Faster Delivery<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>text $ kubectl get pods -n prod NAME READY STATUS RESTARTS AGE api-gateway-v2-7f8d9b4c-xhq2z 0\/1 CrashLoopBackOff 42 (3m ago) 14h order-processor-66d5f4e3-99abc 0\/1 OOMKilled 12 (1m ago) 14h payment-service-55c2a1b0-zxy98 1\/1 Running 0 14h marketing-tracker-88f123a4-bbbbb 1\/1 Running 0 14h $ kubectl logs -f api-gateway-v2-7f8d9b4c-xhq2z &#8211;previous {&#8220;level&#8221;:&#8221;fatal&#8221;,&#8221;ts&#8221;:1715432100.123,&#8221;caller&#8221;:&#8221;main.go:45&#8243;,&#8221;msg&#8221;:&#8221;failed to connect to redis&#8221;,&#8221;error&#8221;:&#8221;dial tcp 10.96.0.15:6379: i\/o timeout&#8221;} {&#8220;level&#8221;:&#8221;info&#8221;,&#8221;ts&#8221;:1715432105.456,&#8221;msg&#8221;:&#8221;Attempting reconnection&#8230; (Attempt 43)&#8221;} &#8230; <a title=\"10 DevOps Best Practices for Faster Software Delivery\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\" aria-label=\"Read more  on 10 DevOps Best Practices for Faster Software Delivery\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4483","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>10 DevOps Best Practices for Faster Software Delivery - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"text $ kubectl get pods -n prod NAME READY STATUS RESTARTS AGE api-gateway-v2-7f8d9b4c-xhq2z 0\/1 CrashLoopBackOff 42 (3m ago) 14h order-processor-66d5f4e3-99abc 0\/1 OOMKilled 12 (1m ago) 14h payment-service-55c2a1b0-zxy98 1\/1 Running 0 14h marketing-tracker-88f123a4-bbbbb 1\/1 Running 0 14h $ kubectl logs -f api-gateway-v2-7f8d9b4c-xhq2z &#8211;previous {&#8220;level&#8221;:&#8221;fatal&#8221;,&#8221;ts&#8221;:1715432100.123,&#8221;caller&#8221;:&#8221;main.go:45&#8243;,&#8221;msg&#8221;:&#8221;failed to connect to redis&#8221;,&#8221;error&#8221;:&#8221;dial tcp 10.96.0.15:6379: i\/o timeout&#8221;} {&#8220;level&#8221;:&#8221;info&#8221;,&#8221;ts&#8221;:1715432105.456,&#8221;msg&#8221;:&#8221;Attempting reconnection&#8230; (Attempt 43)&#8221;} ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-01-27T15:38:25+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"10 DevOps Best Practices for Faster Software Delivery\",\"datePublished\":\"2026-01-27T15:38:25+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\"},\"wordCount\":1614,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\",\"name\":\"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-01-27T15:38:25+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"10 DevOps Best Practices for Faster Software Delivery\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/","og_locale":"en_US","og_type":"article","og_title":"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale","og_description":"text $ kubectl get pods -n prod NAME READY STATUS RESTARTS AGE api-gateway-v2-7f8d9b4c-xhq2z 0\/1 CrashLoopBackOff 42 (3m ago) 14h order-processor-66d5f4e3-99abc 0\/1 OOMKilled 12 (1m ago) 14h payment-service-55c2a1b0-zxy98 1\/1 Running 0 14h marketing-tracker-88f123a4-bbbbb 1\/1 Running 0 14h $ kubectl logs -f api-gateway-v2-7f8d9b4c-xhq2z &#8211;previous {&#8220;level&#8221;:&#8221;fatal&#8221;,&#8221;ts&#8221;:1715432100.123,&#8221;caller&#8221;:&#8221;main.go:45&#8243;,&#8221;msg&#8221;:&#8221;failed to connect to redis&#8221;,&#8221;error&#8221;:&#8221;dial tcp 10.96.0.15:6379: i\/o timeout&#8221;} {&#8220;level&#8221;:&#8221;info&#8221;,&#8221;ts&#8221;:1715432105.456,&#8221;msg&#8221;:&#8221;Attempting reconnection&#8230; (Attempt 43)&#8221;} ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-01-27T15:38:25+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"10 DevOps Best Practices for Faster Software Delivery","datePublished":"2026-01-27T15:38:25+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/"},"wordCount":1614,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/","url":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/","name":"10 DevOps Best Practices for Faster Software Delivery - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-01-27T15:38:25+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-for-faster-software-delivery\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"10 DevOps Best Practices for Faster Software Delivery"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4483","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4483"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4483\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4483"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4483"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4483"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}