{"id":4784,"date":"2026-05-09T21:29:44","date_gmt":"2026-05-09T15:59:44","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/"},"modified":"2026-05-09T21:29:44","modified_gmt":"2026-05-09T15:59:44","slug":"artificial-intelligence-best-practices-a-complete-guide-3","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/","title":{"rendered":"Artificial Intelligence Best Practices: A Complete Guide"},"content":{"rendered":"<p><strong>INTERNAL INCIDENT REPORT: RCA-2023-11-14-GEN-AI-COLLAPSE<\/strong><br \/>\n<strong>TO:<\/strong> Engineering Department, CTO, Product Management (Read it and weep)<br \/>\n<strong>FROM:<\/strong> Senior SRE (Incident Lead)<br \/>\n<strong>STATUS:<\/strong> CRITICAL \/ POST-MORTEM<br \/>\n<strong>SUBJECT:<\/strong> Mandatory &#8220;Artificial Intelligence&#8221; Implementation Standards following the 48-hour Cluster Death Spiral.<\/p>\n<p>I have spent the last 48 hours staring at Grafana dashboards that looked like a heart monitor flatlining. I haven&#8217;t showered, I\u2019ve consumed four liters of cold espresso, and I am currently holding my sanity together with the sheer force of my hatred for how this department handles &#8220;innovation.&#8221;<\/p>\n<p>The &#8220;GenAI-Assistant-v2&#8221; deployment didn&#8217;t just fail; it committed a murder-suicide that took out our entire production environment, including the legacy billing system and the customer-facing API. This happened because someone decided that &#8220;artificial intelligence&#8221; was a magic wand that didn&#8217;t need to follow the laws of thermodynamics or basic systems engineering.<\/p>\n<p>Here is the autopsy. If you ever deploy a model again without following these rules, I will personally revoke your SSH access and move your desk to the basement.<\/p>\n<hr \/>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a039971c6b73\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a039971c6b73\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#1_THE_INCIDENT_TIMELINE_THE_ANATOMY_OF_A_CASCADING_FAILURE\" >1. THE INCIDENT TIMELINE: THE ANATOMY OF A CASCADING FAILURE<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#2_WHAT_WENT_WRONG_THE_%E2%80%9CINNOVATION%E2%80%9D_DELUSION\" >2. WHAT WENT WRONG: THE &#8220;INNOVATION&#8221; DELUSION<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#3_REMEDIATION_THE_MANDATORY_%E2%80%9CARTIFICIAL_INTELLIGENCE%E2%80%9D_BEST_PRACTICES\" >3. REMEDIATION: THE MANDATORY &#8220;ARTIFICIAL INTELLIGENCE&#8221; BEST PRACTICES<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#H2_DETERMINISTIC_RESOURCE_ALLOCATION_AND_GPU_ISOLATION\" >H2: DETERMINISTIC RESOURCE ALLOCATION AND GPU ISOLATION<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#H2_CIRCUIT_BREAKING_AND_THE_%E2%80%9CFAIL-FAST%E2%80%9D_PROTOCOL\" >H2: CIRCUIT BREAKING AND THE &#8220;FAIL-FAST&#8221; PROTOCOL<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#H2_MANDATORY_MODEL_QUANTIZATION_AND_VERSIONING\" >H2: MANDATORY MODEL QUANTIZATION AND VERSIONING<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#H2_VECTOR_DATABASE_INTEGRITY_AND_RATE_LIMITING\" >H2: VECTOR DATABASE INTEGRITY AND RATE LIMITING<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#H2_EXHAUSTIVE_TELEMETRY_AND_GPU-LEVEL_OBSERVABILITY\" >H2: EXHAUSTIVE TELEMETRY AND GPU-LEVEL OBSERVABILITY<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#H2_INPUT_SANITIZATION_AND_TOKEN_BUDGETING\" >H2: INPUT SANITIZATION AND TOKEN BUDGETING<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#4_THE_%E2%80%9CNEVER_AGAIN%E2%80%9D_APPENDIX_SYSTEM_REQUIREMENTS\" >4. THE &#8220;NEVER AGAIN&#8221; APPENDIX: SYSTEM REQUIREMENTS<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#A_HARDWARE_CONSTRAINTS\" >A. HARDWARE CONSTRAINTS<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#B_SOFTWARE_VERSION_LOCK\" >B. SOFTWARE VERSION LOCK<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#C_MONITORING_THRESHOLDS_ALERTS\" >C. MONITORING THRESHOLDS (ALERTS)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#D_THE_%E2%80%9CHUMAN%E2%80%9D_REQUIREMENT\" >D. THE &#8220;HUMAN&#8221; REQUIREMENT<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#FINAL_THOUGHTS\" >FINAL THOUGHTS<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"1_THE_INCIDENT_TIMELINE_THE_ANATOMY_OF_A_CASCADING_FAILURE\"><\/span>1. THE INCIDENT TIMELINE: THE ANATOMY OF A CASCADING FAILURE<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The failure began at 02:14 UTC when the &#8220;GenAI-Assistant-v2&#8221; service was pushed to the <code>p4d.24xlarge<\/code> cluster. The following logs represent the final moments of our stability.<\/p>\n<p><strong>02:14:05 UTC &#8211; Initial Deployment<\/strong><\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ kubectl get pods -n ai-services\nNAME                                     READY   STATUS    RESTARTS   AGE\ngenai-assistant-v2-7f8d9b6c5-x2z4l       1\/1     Running   0          45s\ngenai-assistant-v2-7f8d9b6c5-m9p1q       1\/1     Running   0          42s\n<\/code><\/pre>\n<p><strong>02:16:12 UTC &#8211; The first sign of the VRAM leak. The Python 3.11.4 runtime begins fighting with the CUDA 12.2 driver.<\/strong><\/p>\n<pre class=\"codehilite\"><code class=\"language-text\">[2023-11-14 02:16:12] ERROR:torch.cuda:OutOfMemoryError: CUDA out of memory. \nTried to allocate 12.50 GiB (GPU 0; 40.00 GiB total capacity; 32.15 GiB already allocated; \n5.12 GiB free; 34.00 GiB reserved in total by PyTorch) \nIf reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.\n[2023-11-14 02:16:14] CRITICAL:uvicorn.error:Node 04 - Heartbeat failed. Process killed by OOM Killer.\n<\/code><\/pre>\n<p><strong>02:18:45 UTC &#8211; The &#8220;Smart Retry&#8221; logic kicks in. Because the &#8220;artificial intelligence&#8221; service was configured with an infinite retry loop on 5xx errors, it created a self-inflicted DDoS.<\/strong><\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ tail -f \/var\/log\/nginx\/access.log | grep &quot;503&quot;\n10.0.45.12 - - [14\/Nov\/2023:02:18:45 +0000] &quot;POST \/v1\/chat\/completions HTTP\/1.1&quot; 503 197 &quot;-&quot; &quot;python-requests\/2.31.0&quot;\n10.0.45.12 - - [14\/Nov\/2023:02:18:45 +0000] &quot;POST \/v1\/chat\/completions HTTP\/1.1&quot; 503 197 &quot;-&quot; &quot;python-requests\/2.31.0&quot;\n10.0.45.13 - - [14\/Nov\/2023:02:18:46 +0000] &quot;POST \/v1\/chat\/completions HTTP\/1.1&quot; 503 197 &quot;-&quot; &quot;python-requests\/2.31.0&quot;\n# ... 4,000 more lines per second ...\n<\/code><\/pre>\n<p><strong>02:22:10 UTC &#8211; The Vector Database (Pinecone-local-proxy) hits 100% CPU because the LLM is sending malformed, un-truncated embedding requests.<\/strong><\/p>\n<pre class=\"codehilite\"><code class=\"language-bash\">$ top -bn1 | grep &quot;vector-db&quot;\nPID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND\n8922 root      20   0   45.2g  38.1g   1.2g R  398.4  78.2   14:22.11 vector-db-engine\n<\/code><\/pre>\n<p>By 02:30 UTC, the entire <code>p4d.24xlarge<\/code> fleet was unresponsive. The control plane was so overwhelmed by the &#8220;artificial intelligence&#8221; service&#8217;s death rattles that we couldn&#8217;t even run <code>kubectl delete<\/code>. We had to manually power-cycle the instances via the AWS console like it was 2005.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"2_WHAT_WENT_WRONG_THE_%E2%80%9CINNOVATION%E2%80%9D_DELUSION\"><\/span>2. WHAT WENT WRONG: THE &#8220;INNOVATION&#8221; DELUSION<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The root cause was a combination of hubris and technical illiteracy. The &#8220;artificial intelligence&#8221; team decided to use PyTorch 2.1.0 with a custom-compiled kernel that hadn&#8217;t been tested against our specific NVIDIA driver version. <\/p>\n<ol>\n<li><strong>VRAM Fragmentation<\/strong>: You treated GPU memory like a standard heap. It isn&#8217;t. The model was attempting to load a 70B parameter model in FP16 without proper quantization. On an A100 with 40GB of VRAM, you have zero margin for error. The moment the KV cache expanded during a long-context request, the memory fragmented, and the service crashed.<\/li>\n<li><strong>The &#8220;Smart&#8221; Retry Storm<\/strong>: Some genius implemented a &#8220;Retry-on-Failure&#8221; policy in the middleware using an &#8220;artificial intelligence&#8221; heuristic to &#8220;predict&#8221; when the service would be back up. It predicted wrong. It slammed the load balancer with 15,000 requests per second while the pods were still in a <code>CrashLoopBackOff<\/code> state.<\/li>\n<li><strong>Dependency Hell<\/strong>: The service was running Python 3.11.4, but the base image was pulled from a &#8220;community&#8221; repo that included a conflicting version of <code>libcusparse.so.12<\/code>. This caused a silent memory leak in the background that didn&#8217;t show up in our standard Prometheus metrics until the node hit a hard lock.<\/li>\n<li><strong>Unbounded Context Windows<\/strong>: There was no limit on the input token length. A user (or a bot) sent a 50,000-word prompt, and the &#8220;artificial intelligence&#8221; tried to process it. This spiked the memory usage on the A100s, leading to the OOM kill that started the domino effect.<\/li>\n<\/ol>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"3_REMEDIATION_THE_MANDATORY_%E2%80%9CARTIFICIAL_INTELLIGENCE%E2%80%9D_BEST_PRACTICES\"><\/span>3. REMEDIATION: THE MANDATORY &#8220;ARTIFICIAL INTELLIGENCE&#8221; BEST PRACTICES<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>From this moment forward, these are not suggestions. They are requirements. If your PR does not meet these standards, it will be closed without comment.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_DETERMINISTIC_RESOURCE_ALLOCATION_AND_GPU_ISOLATION\"><\/span>H2: DETERMINISTIC RESOURCE ALLOCATION AND GPU ISOLATION<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Stop treating GPUs like a shared pool of magic dust. Every service utilizing &#8220;artificial intelligence&#8221; must have hard resource limits defined in the manifest. <\/p>\n<p>You will use <code>nvidia-smi<\/code> to profile your model&#8217;s peak memory usage under maximum context load. If your model requires 32GB of VRAM, you will limit the container to 34GB. No more, no less. We are moving to a strict one-pod-per-GPU architecture on our <code>p4d.24xlarge<\/code> instances. <\/p>\n<p>Furthermore, you must implement <code>torch.cuda.empty_cache()<\/code> calls at logical boundaries in your inference loop. I don&#8217;t care if it adds 5ms of latency. I care that the node stays alive. If I see another &#8220;CUDA out of memory&#8221; error because you were too lazy to manage the garbage collector, you\u2019re off the project.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_CIRCUIT_BREAKING_AND_THE_%E2%80%9CFAIL-FAST%E2%80%9D_PROTOCOL\"><\/span>H2: CIRCUIT BREAKING AND THE &#8220;FAIL-FAST&#8221; PROTOCOL<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The &#8220;Smart Retry&#8221; logic is dead. It is buried in a shallow grave. From now on, we use standard exponential backoff with jitter. <\/p>\n<p>If an &#8220;artificial intelligence&#8221; inference request takes longer than 5000ms, the circuit breaker must trip. We will return a 504 Gateway Timeout to the user rather than allowing the request to sit in a queue, holding onto VRAM and blocking other threads. <\/p>\n<p>You will implement the <code>Hystrix<\/code> pattern or an equivalent in our service mesh. If the error rate for the LLM service exceeds 5% over a 60-second window, the service must automatically shut down its ingress and allow the pods to stabilize. We do not &#8220;hope&#8221; the service recovers; we force it to recover.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_MANDATORY_MODEL_QUANTIZATION_AND_VERSIONING\"><\/span>H2: MANDATORY MODEL QUANTIZATION AND VERSIONING<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Running raw FP16 models in production is a luxury we can no longer afford. Every model must be quantized to INT8 or 4-bit (using AWQ or GPTQ) unless you can provide a mathematical proof that the loss in precision will destroy the business logic.<\/p>\n<p>Versioning is now strictly enforced.<br \/>\n&#8211; <strong>Python<\/strong>: 3.11.4 (No exceptions).<br \/>\n&#8211; <strong>PyTorch<\/strong>: 2.1.0.<br \/>\n&#8211; <strong>CUDA<\/strong>: 12.2.<br \/>\n&#8211; <strong>Transformers<\/strong>: 4.34.0.<\/p>\n<p>If you want to upgrade a library, you must submit a 10-page performance regression report. We are not your playground for testing the latest beta releases from Hugging Face. We are a production environment.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_VECTOR_DATABASE_INTEGRITY_AND_RATE_LIMITING\"><\/span>H2: VECTOR DATABASE INTEGRITY AND RATE LIMITING<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The vector database is not a dumping ground. The incident showed that we were sending 1536-dimension embeddings to the index without any validation. <\/p>\n<p>Every request to the vector DB must be pre-validated. If the embedding vector contains <code>NaN<\/code> or <code>Inf<\/code> values\u2014which happened during the crash\u2014the request must be dropped immediately. <\/p>\n<p>We are also implementing a hard rate limit on the embedding service. You will not be allowed to burst more than 200 requests per second per API key. If your &#8220;artificial intelligence&#8221; feature needs more than that, your architecture is inefficient, and you need to go back to the drawing board.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_EXHAUSTIVE_TELEMETRY_AND_GPU-LEVEL_OBSERVABILITY\"><\/span>H2: EXHAUSTIVE TELEMETRY AND GPU-LEVEL OBSERVABILITY<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Our current monitoring is useless for &#8220;artificial intelligence&#8221; workloads. &#8220;CPU Usage&#8221; means nothing when the bottleneck is the PCIe bus bandwidth or the NVLink interconnect.<\/p>\n<p>New dashboards are being rolled out. You are required to export the following metrics from your inference containers:<br \/>\n&#8211; <code>gpu_utilization_percentage<\/code><br \/>\n&#8211; <code>gpu_memory_used_bytes<\/code><br \/>\n&#8211; <code>gpu_temperature_celsius<\/code><br \/>\n&#8211; <code>token_generation_latency_ms<\/code><br \/>\n&#8211; <code>kv_cache_utilization_ratio<\/code><\/p>\n<p>If your service does not export these metrics to Prometheus, it will be killed by a cron job every 10 minutes. I am not joking.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"H2_INPUT_SANITIZATION_AND_TOKEN_BUDGETING\"><\/span>H2: INPUT SANITIZATION AND TOKEN BUDGETING<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>You wouldn&#8217;t accept a 10GB SQL injection attack, so why are you accepting unbounded text prompts? <\/p>\n<p>Every &#8220;artificial intelligence&#8221; entry point must have a strict token budget. Use <code>tiktoken<\/code> or the relevant library to count tokens <em>before<\/em> they hit the inference engine. If the count exceeds the budget (e.g., 4096 tokens), the request is rejected at the edge. <\/p>\n<p>Stop assuming the LLM will &#8220;handle it.&#8221; The LLM is a math equation, not a person. It will try to solve whatever garbage you give it until the hardware catches fire. You are the gatekeeper. Act like it.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"4_THE_%E2%80%9CNEVER_AGAIN%E2%80%9D_APPENDIX_SYSTEM_REQUIREMENTS\"><\/span>4. THE &#8220;NEVER AGAIN&#8221; APPENDIX: SYSTEM REQUIREMENTS<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>To ensure we never repeat the 48-hour &#8220;War Room&#8221; from hell, the following configuration constraints are now hard-coded into the CI\/CD pipeline.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"A_HARDWARE_CONSTRAINTS\"><\/span>A. HARDWARE CONSTRAINTS<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<ul>\n<li><strong>Instance Type<\/strong>: <code>p4d.24xlarge<\/code> only for production inference.<\/li>\n<li><strong>VRAM Limit<\/strong>: 85% of total capacity. The remaining 15% is reserved for system overhead and KV cache expansion.<\/li>\n<li><strong>Storage<\/strong>: All model weights must be pre-loaded onto NVMe instance stores. No loading over S3 at runtime. This caused a 10-minute cold-start delay that exacerbated the outage.<\/li>\n<\/ul>\n<h4><span class=\"ez-toc-section\" id=\"B_SOFTWARE_VERSION_LOCK\"><\/span>B. SOFTWARE VERSION LOCK<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<table>\n<thead>\n<tr>\n<th style=\"text-align: left;\">Component<\/th>\n<th style=\"text-align: left;\">Required Version<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left;\">Python<\/td>\n<td style=\"text-align: left;\">3.11.4<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">PyTorch<\/td>\n<td style=\"text-align: left;\">2.1.0<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">CUDA Driver<\/td>\n<td style=\"text-align: left;\">535.104.05<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">CUDA Toolkit<\/td>\n<td style=\"text-align: left;\">12.2<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">NCCL<\/td>\n<td style=\"text-align: left;\">2.18.3<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">Triton<\/td>\n<td style=\"text-align: left;\">2.1.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h4><span class=\"ez-toc-section\" id=\"C_MONITORING_THRESHOLDS_ALERTS\"><\/span>C. MONITORING THRESHOLDS (ALERTS)<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<ul>\n<li><strong>Warning<\/strong>: GPU VRAM &gt; 75% for 3 consecutive minutes.<\/li>\n<li><strong>Critical<\/strong>: GPU VRAM &gt; 90% for 30 seconds (Triggers automatic pod restart).<\/li>\n<li><strong>Critical<\/strong>: Inference Latency (P99) &gt; 10,000ms.<\/li>\n<li><strong>Critical<\/strong>: Error Rate (5xx) &gt; 2% of total traffic.<\/li>\n<\/ul>\n<h4><span class=\"ez-toc-section\" id=\"D_THE_%E2%80%9CHUMAN%E2%80%9D_REQUIREMENT\"><\/span>D. THE &#8220;HUMAN&#8221; REQUIREMENT<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Before any new &#8220;artificial intelligence&#8221; feature is enabled for more than 1% of traffic, the lead developer must sit in a room with the SRE team and explain, in detail, how the service handles a total loss of the GPU cluster. If the answer is &#8220;it shouldn&#8217;t happen,&#8221; the feature is denied.<\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"FINAL_THOUGHTS\"><\/span>FINAL THOUGHTS<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>I am going home now. I am going to sleep for 14 hours. When I come back, I expect to see the &#8220;GenAI-Assistant-v2&#8221; repository scrubbed of its current &#8220;smart&#8221; logic and replaced with the deterministic, boring, and stable code I have outlined above.<\/p>\n<p>&#8220;Artificial intelligence&#8221; is just another service. It is not an excuse for sloppy engineering. It is not a reason to ignore 40 years of distributed systems best practices. It is a resource-heavy, unstable, and temperamental piece of software that needs to be caged, monitored, and treated with extreme suspicion.<\/p>\n<p>If you want to play with toys, go to a sandbox. If you want to run code in my production environment, follow the manual.<\/p>\n<p><strong>Signed,<\/strong><\/p>\n<p><em>The SRE who had to fix your mess.<\/em><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/latest-artificial-intelligence-news-top-trends-and-updates\/\">Latest Artificial Intelligence News Top Trends And Updates<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/what-is-aws-the-ultimate-guide-to-amazon-web-services\/\">What Is Aws The Ultimate Guide To Amazon Web Services<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/mastering-react-development-best-practices-for-2024\/\">Mastering React Development Best Practices For 2024<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>INTERNAL INCIDENT REPORT: RCA-2023-11-14-GEN-AI-COLLAPSE TO: Engineering Department, CTO, Product Management (Read it and weep) FROM: Senior SRE (Incident Lead) STATUS: CRITICAL \/ POST-MORTEM SUBJECT: Mandatory &#8220;Artificial Intelligence&#8221; Implementation Standards following the 48-hour Cluster Death Spiral. I have spent the last 48 hours staring at Grafana dashboards that looked like a heart monitor flatlining. I haven&#8217;t &#8230; <a title=\"Artificial Intelligence Best Practices: A Complete Guide\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\" aria-label=\"Read more  on Artificial Intelligence Best Practices: A Complete Guide\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4784","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Artificial Intelligence Best Practices: A Complete Guide - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Artificial Intelligence Best Practices: A Complete Guide - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"INTERNAL INCIDENT REPORT: RCA-2023-11-14-GEN-AI-COLLAPSE TO: Engineering Department, CTO, Product Management (Read it and weep) FROM: Senior SRE (Incident Lead) STATUS: CRITICAL \/ POST-MORTEM SUBJECT: Mandatory &#8220;Artificial Intelligence&#8221; Implementation Standards following the 48-hour Cluster Death Spiral. I have spent the last 48 hours staring at Grafana dashboards that looked like a heart monitor flatlining. I haven&#8217;t ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-09T15:59:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"Artificial Intelligence Best Practices: A Complete Guide\",\"datePublished\":\"2026-05-09T15:59:44+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\"},\"wordCount\":1571,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\",\"name\":\"Artificial Intelligence Best Practices: A Complete Guide - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-05-09T15:59:44+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Artificial Intelligence Best Practices: A Complete Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Artificial Intelligence Best Practices: A Complete Guide - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/","og_locale":"en_US","og_type":"article","og_title":"Artificial Intelligence Best Practices: A Complete Guide - ITSupportWale","og_description":"INTERNAL INCIDENT REPORT: RCA-2023-11-14-GEN-AI-COLLAPSE TO: Engineering Department, CTO, Product Management (Read it and weep) FROM: Senior SRE (Incident Lead) STATUS: CRITICAL \/ POST-MORTEM SUBJECT: Mandatory &#8220;Artificial Intelligence&#8221; Implementation Standards following the 48-hour Cluster Death Spiral. I have spent the last 48 hours staring at Grafana dashboards that looked like a heart monitor flatlining. I haven&#8217;t ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-05-09T15:59:44+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"Artificial Intelligence Best Practices: A Complete Guide","datePublished":"2026-05-09T15:59:44+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/"},"wordCount":1571,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/","url":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/","name":"Artificial Intelligence Best Practices: A Complete Guide - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-05-09T15:59:44+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/artificial-intelligence-best-practices-a-complete-guide-3\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Artificial Intelligence Best Practices: A Complete Guide"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4784","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4784"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4784\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4784"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4784"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4784"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}