{"id":4799,"date":"2026-05-26T23:29:01","date_gmt":"2026-05-26T17:59:01","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/"},"modified":"2026-05-26T23:29:01","modified_gmt":"2026-05-26T17:59:01","slug":"devops-best-practices-guide","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/","title":{"rendered":"DevOps Best Practices &#8211; Guide"},"content":{"rendered":"<p>The hum of the Dell PowerEdge R750s isn&#8217;t a lullaby; it\u2019s a funeral dirge. It is 3:14 AM. I am currently sitting on a milk crate in Data Center Floor 4, Row 12, because the &#8220;Rockstar&#8221; Lead Architect decided that &#8220;physical presence during a crisis fosters team synergy.&#8221; The synergy is currently at zero, much like our uptime for the last forty-eight hours. My coffee is cold, my eyes feel like they\u2019ve been scrubbed with steel wool, and I am staring at a terminal window that is screaming in ANSI color codes.<\/p>\n<p>This wasn&#8217;t supposed to happen. We were told that moving to Kubernetes v1.29.2 would solve our scaling issues. We were told that Helm v3.14.0 would make deployments &#8220;easy.&#8221; We were told that &#8220;devops best&#8221; practices were being followed. They weren&#8217;t. What we have instead is a smoking crater where our retail banking API used to be.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a17165d3e719\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a17165d3e719\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#H2_The_PagerDuty_Alert_That_Ended_My_Marriage\" >H2: The PagerDuty Alert That Ended My Marriage<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#H2_The_CICD_Pipeline_is_a_Pipe_Dream\" >H2: The CI\/CD Pipeline is a Pipe Dream<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#H2_Infrastructure_as_Chaos_IaC\" >H2: Infrastructure as Chaos (IaC)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#H2_Observability_is_Not_a_Dashboard\" >H2: Observability is Not a Dashboard<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#H2_The_Culture_of_Blame_and_why_its_justified\" >H2: The Culture of Blame (and why it\u2019s justified)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#H2_Hard-Won_Wisdom_for_the_Next_Victim\" >H2: Hard-Won Wisdom for the Next Victim<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"H2_The_PagerDuty_Alert_That_Ended_My_Marriage\"><\/span>H2: The PagerDuty Alert That Ended My Marriage<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It started with a single JSON payload hitting my phone at 2:00 AM on a Saturday. I was at my anniversary dinner. My wife looked at me, saw the blue light reflecting off my glasses, and knew. She didn&#8217;t even wait for me to speak. She just called the Uber.<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n  &quot;alert_id&quot;: &quot;ALRT-9921-X-FAIL&quot;,\n  &quot;status&quot;: &quot;triggered&quot;,\n  &quot;service&quot;: &quot;legacy-payment-gateway-v2-final-REALLY-FINAL&quot;,\n  &quot;severity&quot;: &quot;CRITICAL&quot;,\n  &quot;summary&quot;: &quot;High Error Rate: 98.4% 5xx Responses in us-east-1&quot;,\n  &quot;details&quot;: {\n    &quot;threshold&quot;: &quot;5%&quot;,\n    &quot;current_value&quot;: &quot;98.4%&quot;,\n    &quot;impact&quot;: &quot;All transaction processing is halted. The CEO is already calling the CTO.&quot;\n  }\n}\n<\/code><\/pre>\n<p>I logged in from the back of the Uber, tethered to a shaky 5G connection. The first thing I did was <code>kubectl get pods -n production<\/code>. The output was a wall of red.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">NAME                                          READY   STATUS             RESTARTS      AGE\npayment-api-6f5d8c7d4b-2w9zq                 0\/1     CrashLoopBackOff   42 (3m ago)   2h\npayment-api-6f5d8c7d4b-5k8lp                 0\/1     Error              12 (5m ago)   2h\npayment-api-6f5d8c7d4b-m9p2r                 0\/1     ImagePullBackOff   0             2h\npayment-api-6f5d8c7d4b-z4x1t                 0\/1     CrashLoopBackOff   38 (1m ago)   2h\ningress-nginx-controller-7f8b9c4d5e-vns2w    1\/1     Running            0             14d\n<\/code><\/pre>\n<p>I checked the events. <code>kubectl get events -n production --sort-by='.lastTimestamp'<\/code>.<\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">2m14s       Warning   BackOff           pod\/payment-api-6f5d8c7d4b-2w9zq   Back-off restarting failed container\n2m15s       Normal    Pulling           pod\/payment-api-6f5d8c7d4b-m9p2r   Pulling image &quot;our-priv-reg.io\/fintech\/payment-api:latest&quot;\n2m16s       Warning   Failed            pod\/payment-api-6f5d8c7d4b-m9p2r   Failed to pull image &quot;our-priv-reg.io\/fintech\/payment-api:latest&quot;: rpc error: code = NotFound desc = failed to pull and unpack image &quot;our-priv-reg.io\/fintech\/payment-api:latest&quot;: no match for platform in manifest\n<\/code><\/pre>\n<p>The &#8220;Rockstar&#8221; dev, let\u2019s call him Kyle, had pushed a &#8220;quick fix&#8221; to the Helm chart. He didn&#8217;t use a versioned tag. He used <code>:latest<\/code>. And because he\u2019s a &#8220;visionary,&#8221; he decided to change the base image to an Alpine-based Go build that didn&#8217;t include the legacy C libraries our middleware requires. He bypassed the &#8220;devops best&#8221; practice of using immutable tags because &#8220;tags are for people who don&#8217;t trust their CI\/CD pipeline.&#8221;<\/p>\n<p>I trust my CI\/CD pipeline. I just don&#8217;t trust the people who write the YAML that defines it.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_The_CICD_Pipeline_is_a_Pipe_Dream\"><\/span>H2: The CI\/CD Pipeline is a Pipe Dream<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Kyle\u2019s &#8220;quick fix&#8221; bypassed the staging environment because he had manually edited the Jenkinsfile to include a <code>[skip-stage]<\/code> flag he\u2019d invented. He thought he was being efficient. He thought he was &#8220;moving fast.&#8221; In reality, he was just bypassing the only safety net we had left.<\/p>\n<p>Our Jenkins instance, running on a bloated VM that hasn&#8217;t been patched since 2022, accepted his push. The pipeline looked like this in the <code>Jenkinsfile<\/code>:<\/p>\n<pre class=\"codehilite\"><code class=\"language-groovy\">stage('Deploy to Prod') {\n    when {\n        expression { return params.SKIP_STAGING == true }\n    }\n    steps {\n        sh &quot;helm upgrade --install payment-api .\/charts\/payment-api --namespace production --set image.tag=latest&quot;\n    }\n}\n<\/code><\/pre>\n<p>The &#8220;devops best&#8221; way to handle this is to use a GitOps operator like ArgoCD or Flux, where the state of the cluster is defined in a repository and reconciled automatically. But no, we had to use &#8220;Kyle-Ops.&#8221; Kyle-Ops involves running <code>helm upgrade<\/code> from a local machine or a rogue Jenkins runner with cluster-admin privileges.<\/p>\n<p>I looked at the <code>values.yaml<\/code> Kyle had modified. He had changed the <code>resources<\/code> block because he thought the JVM needed more &#8220;breathing room.&#8221;<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">resources:\n  limits:\n    cpu: &quot;200m&quot;\n    memory: &quot;512Mi&quot;\n  requests:\n    cpu: &quot;100m&quot;\n    memory: &quot;256Mi&quot;\n<\/code><\/pre>\n<p>For a legacy Java application that handles 5,000 transactions per second. In a financial institution. He set the memory limit to 512Mi. The JVM heap alone was configured for 2Gi in the environment variables. The result? The OOMKiller was having a field day. Every time a pod started, it tried to allocate memory, hit the cgroup limit defined by Kubernetes v1.29.2, and was promptly executed by the kernel.<\/p>\n<p>I tried to roll back. <code>helm rollback payment-api 142 -n production<\/code>.<\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">Error: rollback failed: query: failed to find object to update: customresourcedefinitions.apiextensions.k8s.io &quot;paymentgateways.fintech.io&quot; not found\n<\/code><\/pre>\n<p>Kyle hadn&#8217;t just changed the image. He had deleted a Custom Resource Definition (CRD) that he thought was &#8220;redundant.&#8221; Now, the Helm release was in a <code>FAILED<\/code> state, and the Kubernetes API server didn&#8217;t know how to handle the orphaned resources. This is what happens when you ignore the &#8220;devops best&#8221; principle of schema validation and dry-runs. You end up with a cluster that is half-dead and refusing to be resurrected.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_Infrastructure_as_Chaos_IaC\"><\/span>H2: Infrastructure as Chaos (IaC)<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>While I was fighting the Helm release, our cloud infrastructure started to dissolve. Apparently, another &#8220;rockstar&#8221; on the team, Sarah, decided that 3:00 AM was the perfect time to run a Terraform apply to &#8220;clean up some unused security groups.&#8221;<\/p>\n<p>We use Terraform v1.7.4. We are supposed to use a remote S3 backend with DynamoDB for state locking. Sarah, however, couldn&#8217;t get the lock because Kyle\u2019s failed Jenkins job had crashed while holding it. Instead of investigating why the lock was held, she used <code>-lock=false<\/code>.<\/p>\n<p>I saw the Slack notification from the Terraform Cloud integration. It was a massacre.<\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">Terraform will perform the following actions:\n\n  # module.vpc.aws_route_table.public will be destroyed\n  - resource &quot;aws_route_table&quot; &quot;public&quot; {\n      - id = &quot;rtb-0a1b2c3d4e5f6g7h8&quot;\n      - vpc_id = &quot;vpc-05d8f9e0a1b2c3d4e&quot;\n      # (all other attributes omitted)\n    }\n\n  # module.eks.aws_eks_node_group.primary will be updated in-place\n  ~ resource &quot;aws_eks_node_group&quot; &quot;primary&quot; {\n      ~ desired_size = 20 -&gt; 2\n    }\n\nPlan: 0 to add, 1 to change, 14 to destroy.\n<\/code><\/pre>\n<p>She didn&#8217;t read the plan. She just typed <code>yes<\/code>.<\/p>\n<p>Suddenly, my SSH session to the jump box died. The public route table was gone. The EKS nodes were being terminated because she\u2019d accidentally changed the <code>desired_size<\/code> in the <code>terraform.tfvars<\/code> file while &#8220;cleaning up.&#8221;<\/p>\n<p>The &#8220;devops best&#8221; approach to Infrastructure as Code is not just &#8220;writing code.&#8221; It\u2019s about peer reviews, automated plan analysis, and never, ever, under any circumstances, bypassing state locks. If the lock is there, it\u2019s there for a reason. It\u2019s the universe\u2019s way of telling you that someone else is currently breaking the world and you should wait your turn.<\/p>\n<p>I had to use the AWS Console\u2014the ultimate shame for an SRE\u2014to manually recreate the route table and reattach it to the subnets just to get back into the environment. While I was clicking through the laggy UI, I could hear the phantom screams of a thousand YAML files being parsed and rejected.<\/p>\n<p>I finally got back in and checked the Terraform state. It was corrupted. Sarah\u2019s manual override had created a split-brain scenario where the state file thought the resources existed, but the AWS API knew they didn&#8217;t, or vice versa. I spent four hours running <code>terraform import<\/code> for thirty-two different resources.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">terraform import module.vpc.aws_subnet.public_a subnet-0a1b2c3d4e5f6g7h8\nterraform import module.vpc.aws_subnet.public_b subnet-1b2c3d4e5f6g7h8a9\n# ... repeat until my fingers bleed\n<\/code><\/pre>\n<p>This is the reality of &#8220;NoOps.&#8221; It\u2019s not that there are no operations; it\u2019s that the operations are performed by people who don&#8217;t understand the underlying systems, leading to a state of permanent emergency.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_Observability_is_Not_a_Dashboard\"><\/span>H2: Observability is Not a Dashboard<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>By 8:00 AM, the network was back, and the pods were no longer OOMKilled because I\u2019d manually patched the deployment to have sane resource limits. But the 5xx errors remained.<\/p>\n<p>I looked at our Grafana dashboard. It was beautiful. There were 500 panels with neon-colored lines showing CPU usage, memory pressure, and &#8220;Pod Restarts per Second.&#8221; Everything was green. Why? Because the &#8220;Rockstar&#8221; team had configured the dashboards to show <em>averages<\/em> over a 30-minute window. The 98% failure rate was being smoothed out by the 2% of successful health checks from the previous hour.<\/p>\n<p>&#8220;Look at the dashboard!&#8221; Kyle shouted over Zoom. &#8220;The metrics say we\u2019re fine!&#8221;<\/p>\n<p>&#8220;The metrics are lying to you, Kyle,&#8221; I whispered, my voice raspy from lack of sleep. &#8220;The metrics are a comfort blanket for people who are afraid of the logs.&#8221;<\/p>\n<p>I bypassed the dashboard and went straight to the source. I ran a <code>grep<\/code> on the Nginx ingress logs.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">kubectl logs -n ingress-nginx ingress-nginx-controller-7f8b9c4d5e-vns2w | grep &quot; 504 &quot; | head -n 20\n<\/code><\/pre>\n<p>The output was a stream of upstream timeouts.<\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">2024\/05\/20 12:14:32 [error] 142#142: *104923 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.0.1.42, server: api.fintech.io, request: &quot;POST \/v1\/transactions HTTP\/1.1&quot;, upstream: &quot;http:\/\/10.0.2.15:8080\/v1\/transactions&quot;\n<\/code><\/pre>\n<p>The application wasn&#8217;t crashing anymore, but it wasn&#8217;t responding either. I checked the database connection pool. We use PostgreSQL 15.6 hosted on RDS. I logged into the DB and ran:<\/p>\n<pre class=\"codehilite\"><code class=\"language-sql\">SELECT count(*), state FROM pg_stat_activity GROUP BY state;\n<\/code><\/pre>\n<pre class=\"codehilite\"><code class=\"language-text\"> count | state\n-------+--------\n   498 | active\n     2 | idle\n<\/code><\/pre>\n<p>The connection pool was exhausted. Why? Because when Kyle changed the base image, he also &#8220;optimized&#8221; the connection string in the <code>ConfigMap<\/code>. He had removed the <code>timeout<\/code> and <code>tcpKeepAlive<\/code> parameters because he thought the &#8220;cloud handles that automatically.&#8221;<\/p>\n<p>The &#8220;devops best&#8221; practice for observability isn&#8217;t just about pretty graphs. It\u2019s about distributed tracing and deep instrumentation. If we had OpenTelemetry properly implemented, we would have seen the trace dying at the database driver layer. Instead, we were blind, staring at a Grafana panel that told us everything was &#8220;vibrant&#8221; and &#8220;seamless&#8221; when it was actually on fire.<\/p>\n<p>I had to manually kill the hanging sessions in Postgres to allow the application to reconnect.<\/p>\n<pre class=\"codehilite\"><code class=\"language-sql\">SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - query_start &gt; interval '5 minutes';\n<\/code><\/pre>\n<p>As soon as I ran that, the 5xx errors dropped to 40%. Progress. But the root cause was still lurking in the YAML.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_The_Culture_of_Blame_and_why_its_justified\"><\/span>H2: The Culture of Blame (and why it\u2019s justified)<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>At 2:00 PM, the Agile Coach scheduled a &#8220;Synchronous Alignment Sync.&#8221; I wanted to throw my laptop into a woodchipper.<\/p>\n<p>&#8220;We need to focus on a blameless post-mortem,&#8221; she said, her voice chirpy and devoid of the trauma of seeing a production database choke to death. &#8220;We need to understand the <em>process<\/em> failure, not the <em>individual<\/em> failure.&#8221;<\/p>\n<p>I\u2019ll tell you the process failure: we hired people who think &#8220;devops best&#8221; means &#8220;I can do whatever I want as long as I use a tool written in Go.&#8221;<\/p>\n<p>The cost of this outage is currently sitting at approximately $54,000 per minute. We\u2019ve been down, or partially down, for nearly 3,000 minutes. That\u2019s $162 million. You can buy a lot of &#8220;psychological safety&#8221; for $162 million.<\/p>\n<p>The &#8220;Move Fast and Break Things&#8221; mentality works when you\u2019re building a photo-sharing app for cats. It does not work when you are moving billions of dollars for people who need that money to pay their mortgages. In a legacy financial institution, &#8220;breaking things&#8221; is called &#8220;a regulatory nightmare.&#8221;<\/p>\n<p>The culture of DevOps is supposed to be about shared responsibility. But in reality, it often becomes a way for developers to throw half-baked YAML over a virtual wall and expect the SREs to catch it while it\u2019s on fire. They get the glory of the &#8220;feature launch,&#8221; and I get the 3:00 AM PagerDuty alert.<\/p>\n<p>I looked at the <code>deployment.yaml<\/code> again. I found another &#8220;gem&#8221; from Kyle.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">readinessProbe:\n  httpGet:\n    path: \/health\n    port: 8080\n  initialDelaySeconds: 0\n  periodSeconds: 1\n  failureThreshold: 1\n<\/code><\/pre>\n<p>He set the <code>initialDelaySeconds<\/code> to 0 and the <code>failureThreshold<\/code> to 1. This meant that as soon as the pod started, Kubernetes would hit the health check. If the Java app\u2014which takes 45 seconds to start because of its 15-year-old Spring Framework\u2014didn&#8217;t respond within one second, Kubernetes would mark it as unhealthy and stop sending it traffic. Or worse, if it was a liveness probe, it would kill the pod and restart it.<\/p>\n<p>This is why the pods were in a <code>CrashLoopBackOff<\/code>. It wasn&#8217;t just the memory limits. It was a fundamental misunderstanding of how Kubernetes manages application lifecycles. Kyle wanted &#8220;instant scaling,&#8221; so he removed the delays. He broke the very mechanism that ensures traffic only goes to healthy pods.<\/p>\n<p>This isn&#8217;t a &#8220;process failure.&#8221; This is a basic lack of technical competence masked by buzzwords.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_Hard-Won_Wisdom_for_the_Next_Victim\"><\/span>H2: Hard-Won Wisdom for the Next Victim<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It is now 3:45 AM. The system is stable, mostly because I\u2019ve locked everyone else\u2019s access to the production cluster. I have reverted the Terraform state, fixed the Helm charts, and manually adjusted the RDS connection limits. The YAML has stopped screaming. For now.<\/p>\n<p>If you want to actually implement &#8220;devops best&#8221; practices, here is the warning I\u2019m writing in these terminal logs:<\/p>\n<ol>\n<li><strong>Immutable Everything<\/strong>: If I see <code>:latest<\/code> in a production manifest again, I will personally revoke your git access. Use SHAs. Use semantic versioning. Know exactly what code is running on your servers.<\/li>\n<li><strong>Test Your YAML<\/strong>: YAML is not &#8220;just configuration.&#8221; It is the code that defines your infrastructure. Use <code>kube-linter<\/code>, use <code>checkov<\/code>, use <code>datree<\/code>. If your YAML doesn&#8217;t pass a linting and security scan, it shouldn&#8217;t get within ten miles of a <code>kubectl apply<\/code> command.<\/li>\n<li><strong>State is Sacred<\/strong>: Terraform state is the source of truth for your physical reality. Treat it with more respect than your own bank account. Never bypass locks. Never run manual applies from your laptop. Use a centralized, governed execution environment.<\/li>\n<li><strong>Observability Requires Empathy<\/strong>: Don&#8217;t build dashboards for yourself; build them for the person who has to debug your mess at 3:00 AM. Monitor the &#8220;Four Golden Signals&#8221;: Latency, Traffic, Errors, and Saturation. If your dashboard doesn&#8217;t show me why the database is crying, it\u2019s just digital wallpaper.<\/li>\n<li><strong>Stop Chasing Buzzwords<\/strong>: Kubernetes won&#8217;t save you. Helm won&#8217;t save you. Terraform won&#8217;t save you. They are just tools. If you don&#8217;t understand how a Linux kernel handles memory or how a TCP handshake works, you are just a script kiddie with a very expensive cloud bill.<\/li>\n<li><strong>StatefulSets are Not Your Friend<\/strong>: Unless you absolutely have to, don&#8217;t run databases in Kubernetes. I spent three hours of this outage trying to recover a PVC that got detached during the node scale-down. Kubernetes is great for ephemeral workloads. It is a nightmare for persistent state when things go sideways.<\/li>\n<\/ol>\n<p>The &#8220;devops best&#8221; way forward is boring. It\u2019s slow. It involves a lot of documentation and even more testing. It involves saying &#8220;no&#8221; to &#8220;rockstar&#8221; developers who want to use the latest alpha feature of a service mesh they heard about on a podcast.<\/p>\n<p>I\u2019m going to finish this coffee. It\u2019s cold and tastes like battery acid, but it\u2019s the only thing keeping me upright. In four hours, the &#8220;Agile Coach&#8221; will want a summary of the incident for the &#8220;Stakeholder Sync.&#8221; I\u2019ll give them a summary. I\u2019ll tell them that the YAML screamed, and nobody was listening.<\/p>\n<p>I\u2019ll tell them that &#8220;devops best&#8221; isn&#8217;t a goal you reach; it\u2019s a discipline you maintain. And right now, this institution is undisciplined, over-engineered, and one &#8220;quick fix&#8221; away from total collapse.<\/p>\n<p>But hey, at least our Grafana dashboards look &#8220;vibrant.&#8221; Oh wait, I can&#8217;t use that word. Let&#8217;s just say they look like a neon sign for a bar that\u2019s already gone bankrupt.<\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\"># Final check before I head home\nkubectl get pods -n production\n<\/code><\/pre>\n<pre class=\"codehilite\"><code class=\"language-text\">NAME                                          READY   STATUS    RESTARTS   AGE\npayment-api-7d8f9e0a1b-abc12                 1\/1     Running   0          4h\npayment-api-7d8f9e0a1b-def34                 1\/1     Running   0          4h\npayment-api-7d8f9e0a1b-ghi56                 1\/1     Running   0          4h\n<\/code><\/pre>\n<p>All green. For now. I\u2019m going home to sleep before the next &#8220;rockstar&#8221; wakes up and decides to &#8220;optimize&#8221; the ingress controller. God help us all.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/javascript-best-practices-write-cleaner-efficient-code\/\">Javascript Best Practices Write Cleaner Efficient Code<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/netplan-static-ip-configure-static-ip-address-on-ubuntu-18-04\/\">Netplan Static Ip Configure Static Ip Address On Ubuntu 18 04<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/master-docker-compose-a-guide-to-multi-container-apps\/\">Master Docker Compose A Guide To Multi Container Apps<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>The hum of the Dell PowerEdge R750s isn&#8217;t a lullaby; it\u2019s a funeral dirge. It is 3:14 AM. I am currently sitting on a milk crate in Data Center Floor 4, Row 12, because the &#8220;Rockstar&#8221; Lead Architect decided that &#8220;physical presence during a crisis fosters team synergy.&#8221; The synergy is currently at zero, much &#8230; <a title=\"DevOps Best Practices &#8211; Guide\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\" aria-label=\"Read more  on DevOps Best Practices &#8211; Guide\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4799","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>DevOps Best Practices - Guide - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"DevOps Best Practices - Guide - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"The hum of the Dell PowerEdge R750s isn&#8217;t a lullaby; it\u2019s a funeral dirge. It is 3:14 AM. I am currently sitting on a milk crate in Data Center Floor 4, Row 12, because the &#8220;Rockstar&#8221; Lead Architect decided that &#8220;physical presence during a crisis fosters team synergy.&#8221; The synergy is currently at zero, much ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-26T17:59:01+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"DevOps Best Practices &#8211; Guide\",\"datePublished\":\"2026-05-26T17:59:01+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\"},\"wordCount\":2279,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\",\"name\":\"DevOps Best Practices - Guide - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-05-26T17:59:01+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"DevOps Best Practices &#8211; Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"DevOps Best Practices - Guide - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/","og_locale":"en_US","og_type":"article","og_title":"DevOps Best Practices - Guide - ITSupportWale","og_description":"The hum of the Dell PowerEdge R750s isn&#8217;t a lullaby; it\u2019s a funeral dirge. It is 3:14 AM. I am currently sitting on a milk crate in Data Center Floor 4, Row 12, because the &#8220;Rockstar&#8221; Lead Architect decided that &#8220;physical presence during a crisis fosters team synergy.&#8221; The synergy is currently at zero, much ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-05-26T17:59:01+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"DevOps Best Practices &#8211; Guide","datePublished":"2026-05-26T17:59:01+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/"},"wordCount":2279,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/","url":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/","name":"DevOps Best Practices - Guide - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-05-26T17:59:01+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/devops-best-practices-guide\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"DevOps Best Practices &#8211; Guide"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4799","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4799"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4799\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4799"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4799"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4799"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}