Artificial Intelligence Best Practices: A Complete Guide

INTERNAL INCIDENT REPORT: RCA-2023-11-14-GEN-AI-COLLAPSE
TO: Engineering Department, CTO, Product Management (Read it and weep)
FROM: Senior SRE (Incident Lead)
STATUS: CRITICAL / POST-MORTEM
SUBJECT: Mandatory “Artificial Intelligence” Implementation Standards following the 48-hour Cluster Death Spiral.

I have spent the last 48 hours staring at Grafana dashboards that looked like a heart monitor flatlining. I haven’t showered, I’ve consumed four liters of cold espresso, and I am currently holding my sanity together with the sheer force of my hatred for how this department handles “innovation.”

The “GenAI-Assistant-v2” deployment didn’t just fail; it committed a murder-suicide that took out our entire production environment, including the legacy billing system and the customer-facing API. This happened because someone decided that “artificial intelligence” was a magic wand that didn’t need to follow the laws of thermodynamics or basic systems engineering.

Here is the autopsy. If you ever deploy a model again without following these rules, I will personally revoke your SSH access and move your desk to the basement.

Table of Contents

1. THE INCIDENT TIMELINE: THE ANATOMY OF A CASCADING FAILURE

The failure began at 02:14 UTC when the “GenAI-Assistant-v2” service was pushed to the p4d.24xlarge cluster. The following logs represent the final moments of our stability.

02:14:05 UTC – Initial Deployment

$ kubectl get pods -n ai-services
NAME                                     READY   STATUS    RESTARTS   AGE
genai-assistant-v2-7f8d9b6c5-x2z4l       1/1     Running   0          45s
genai-assistant-v2-7f8d9b6c5-m9p1q       1/1     Running   0          42s

02:16:12 UTC – The first sign of the VRAM leak. The Python 3.11.4 runtime begins fighting with the CUDA 12.2 driver.

[2023-11-14 02:16:12] ERROR:torch.cuda:OutOfMemoryError: CUDA out of memory. 
Tried to allocate 12.50 GiB (GPU 0; 40.00 GiB total capacity; 32.15 GiB already allocated; 
5.12 GiB free; 34.00 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
[2023-11-14 02:16:14] CRITICAL:uvicorn.error:Node 04 - Heartbeat failed. Process killed by OOM Killer.

02:18:45 UTC – The “Smart Retry” logic kicks in. Because the “artificial intelligence” service was configured with an infinite retry loop on 5xx errors, it created a self-inflicted DDoS.

$ tail -f /var/log/nginx/access.log | grep "503"
10.0.45.12 - - [14/Nov/2023:02:18:45 +0000] "POST /v1/chat/completions HTTP/1.1" 503 197 "-" "python-requests/2.31.0"
10.0.45.12 - - [14/Nov/2023:02:18:45 +0000] "POST /v1/chat/completions HTTP/1.1" 503 197 "-" "python-requests/2.31.0"
10.0.45.13 - - [14/Nov/2023:02:18:46 +0000] "POST /v1/chat/completions HTTP/1.1" 503 197 "-" "python-requests/2.31.0"
# ... 4,000 more lines per second ...

02:22:10 UTC – The Vector Database (Pinecone-local-proxy) hits 100% CPU because the LLM is sending malformed, un-truncated embedding requests.

$ top -bn1 | grep "vector-db"
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
8922 root      20   0   45.2g  38.1g   1.2g R  398.4  78.2   14:22.11 vector-db-engine

By 02:30 UTC, the entire p4d.24xlarge fleet was unresponsive. The control plane was so overwhelmed by the “artificial intelligence” service’s death rattles that we couldn’t even run kubectl delete. We had to manually power-cycle the instances via the AWS console like it was 2005.

2. WHAT WENT WRONG: THE “INNOVATION” DELUSION

The root cause was a combination of hubris and technical illiteracy. The “artificial intelligence” team decided to use PyTorch 2.1.0 with a custom-compiled kernel that hadn’t been tested against our specific NVIDIA driver version.

VRAM Fragmentation: You treated GPU memory like a standard heap. It isn’t. The model was attempting to load a 70B parameter model in FP16 without proper quantization. On an A100 with 40GB of VRAM, you have zero margin for error. The moment the KV cache expanded during a long-context request, the memory fragmented, and the service crashed.
The “Smart” Retry Storm: Some genius implemented a “Retry-on-Failure” policy in the middleware using an “artificial intelligence” heuristic to “predict” when the service would be back up. It predicted wrong. It slammed the load balancer with 15,000 requests per second while the pods were still in a CrashLoopBackOff state.
Dependency Hell: The service was running Python 3.11.4, but the base image was pulled from a “community” repo that included a conflicting version of libcusparse.so.12. This caused a silent memory leak in the background that didn’t show up in our standard Prometheus metrics until the node hit a hard lock.
Unbounded Context Windows: There was no limit on the input token length. A user (or a bot) sent a 50,000-word prompt, and the “artificial intelligence” tried to process it. This spiked the memory usage on the A100s, leading to the OOM kill that started the domino effect.

3. REMEDIATION: THE MANDATORY “ARTIFICIAL INTELLIGENCE” BEST PRACTICES

From this moment forward, these are not suggestions. They are requirements. If your PR does not meet these standards, it will be closed without comment.

H2: DETERMINISTIC RESOURCE ALLOCATION AND GPU ISOLATION

Stop treating GPUs like a shared pool of magic dust. Every service utilizing “artificial intelligence” must have hard resource limits defined in the manifest.

You will use nvidia-smi to profile your model’s peak memory usage under maximum context load. If your model requires 32GB of VRAM, you will limit the container to 34GB. No more, no less. We are moving to a strict one-pod-per-GPU architecture on our p4d.24xlarge instances.

Furthermore, you must implement torch.cuda.empty_cache() calls at logical boundaries in your inference loop. I don’t care if it adds 5ms of latency. I care that the node stays alive. If I see another “CUDA out of memory” error because you were too lazy to manage the garbage collector, you’re off the project.

H2: CIRCUIT BREAKING AND THE “FAIL-FAST” PROTOCOL

The “Smart Retry” logic is dead. It is buried in a shallow grave. From now on, we use standard exponential backoff with jitter.

If an “artificial intelligence” inference request takes longer than 5000ms, the circuit breaker must trip. We will return a 504 Gateway Timeout to the user rather than allowing the request to sit in a queue, holding onto VRAM and blocking other threads.

You will implement the Hystrix pattern or an equivalent in our service mesh. If the error rate for the LLM service exceeds 5% over a 60-second window, the service must automatically shut down its ingress and allow the pods to stabilize. We do not “hope” the service recovers; we force it to recover.

H2: MANDATORY MODEL QUANTIZATION AND VERSIONING

Running raw FP16 models in production is a luxury we can no longer afford. Every model must be quantized to INT8 or 4-bit (using AWQ or GPTQ) unless you can provide a mathematical proof that the loss in precision will destroy the business logic.

Versioning is now strictly enforced.
– Python: 3.11.4 (No exceptions).
– PyTorch: 2.1.0.
– CUDA: 12.2.
– Transformers: 4.34.0.

If you want to upgrade a library, you must submit a 10-page performance regression report. We are not your playground for testing the latest beta releases from Hugging Face. We are a production environment.

H2: VECTOR DATABASE INTEGRITY AND RATE LIMITING

The vector database is not a dumping ground. The incident showed that we were sending 1536-dimension embeddings to the index without any validation.

Every request to the vector DB must be pre-validated. If the embedding vector contains NaN or Inf values—which happened during the crash—the request must be dropped immediately.

We are also implementing a hard rate limit on the embedding service. You will not be allowed to burst more than 200 requests per second per API key. If your “artificial intelligence” feature needs more than that, your architecture is inefficient, and you need to go back to the drawing board.

H2: EXHAUSTIVE TELEMETRY AND GPU-LEVEL OBSERVABILITY

Our current monitoring is useless for “artificial intelligence” workloads. “CPU Usage” means nothing when the bottleneck is the PCIe bus bandwidth or the NVLink interconnect.

New dashboards are being rolled out. You are required to export the following metrics from your inference containers:
– gpu_utilization_percentage
– gpu_memory_used_bytes
– gpu_temperature_celsius
– token_generation_latency_ms
– kv_cache_utilization_ratio

If your service does not export these metrics to Prometheus, it will be killed by a cron job every 10 minutes. I am not joking.

H2: INPUT SANITIZATION AND TOKEN BUDGETING

You wouldn’t accept a 10GB SQL injection attack, so why are you accepting unbounded text prompts?

Every “artificial intelligence” entry point must have a strict token budget. Use tiktoken or the relevant library to count tokens before they hit the inference engine. If the count exceeds the budget (e.g., 4096 tokens), the request is rejected at the edge.

Stop assuming the LLM will “handle it.” The LLM is a math equation, not a person. It will try to solve whatever garbage you give it until the hardware catches fire. You are the gatekeeper. Act like it.

4. THE “NEVER AGAIN” APPENDIX: SYSTEM REQUIREMENTS

To ensure we never repeat the 48-hour “War Room” from hell, the following configuration constraints are now hard-coded into the CI/CD pipeline.

A. HARDWARE CONSTRAINTS

Instance Type: p4d.24xlarge only for production inference.
VRAM Limit: 85% of total capacity. The remaining 15% is reserved for system overhead and KV cache expansion.
Storage: All model weights must be pre-loaded onto NVMe instance stores. No loading over S3 at runtime. This caused a 10-minute cold-start delay that exacerbated the outage.

B. SOFTWARE VERSION LOCK

Component	Required Version
Python	3.11.4
PyTorch	2.1.0
CUDA Driver	535.104.05
CUDA Toolkit	12.2
NCCL	2.18.3
Triton	2.1.0

C. MONITORING THRESHOLDS (ALERTS)

Warning: GPU VRAM > 75% for 3 consecutive minutes.
Critical: GPU VRAM > 90% for 30 seconds (Triggers automatic pod restart).
Critical: Inference Latency (P99) > 10,000ms.
Critical: Error Rate (5xx) > 2% of total traffic.

D. THE “HUMAN” REQUIREMENT

Before any new “artificial intelligence” feature is enabled for more than 1% of traffic, the lead developer must sit in a room with the SRE team and explain, in detail, how the service handles a total loss of the GPU cluster. If the answer is “it shouldn’t happen,” the feature is denied.

FINAL THOUGHTS

I am going home now. I am going to sleep for 14 hours. When I come back, I expect to see the “GenAI-Assistant-v2” repository scrubbed of its current “smart” logic and replaced with the deterministic, boring, and stable code I have outlined above.

“Artificial intelligence” is just another service. It is not an excuse for sloppy engineering. It is not a reason to ignore 40 years of distributed systems best practices. It is a resource-heavy, unstable, and temperamental piece of software that needs to be caged, monitored, and treated with extreme suspicion.

If you want to play with toys, go to a sandbox. If you want to run code in my production environment, follow the manual.

Signed,

The SRE who had to fix your mess.

Explore more insights and best practices: