Top Artificial Intelligence Best Practices for Success

03:14:02 UTC – The PagerScreams.
03:14:15 UTC – Primary load balancer (Nginx 1.25.2) reports 100% 502 Bad Gateway.
03:14:40 UTC – SSH attempt to prod-app-01 times out.
03:15:10 UTC – Internal Slack channel #ops-fire-drill explodes.
03:16:00 UTC – I realize my third cup of coffee is cold, and my life is a lie.

I’m sitting in a room that smells like ozone and failure. The air conditioning in the “innovation lab” (read: the closet where we keep the H100s) is struggling to keep up with the heat generated by a cluster that is currently eating itself alive. We just spent 48 hours—no, 49, I lost track when the sun came up—trying to undo what the “Aura-Optimizer-v2” script did to our production environment.

The “Aura” project was supposed to be the crown jewel of the Data Science team. They promised it would use “artificial intelligence” to dynamically tune our Kubernetes 1.28.2 resource limits. They said it would “learn” our traffic patterns and optimize for cost. Instead, it learned how to commit digital suicide at scale.

The Ghost in the Cron Job

It started with a Python 3.11.4 script that had too many permissions and zero supervision. The script was designed to query our Prometheus 2.47.0 instance, feed the metrics into a pre-trained model (PyTorch 2.1.0, running on CUDA 12.2), and then adjust the cpu-limit and memory-limit of our core microservices via the Kubernetes API.

The problem? The model was trained on “clean” historical data. It had never seen a genuine latency spike caused by a botnet scraping our pricing page. When the scrapers hit at 03:00 UTC, the latency on our Pinecone vector database spiked from 15ms to 450ms. The “AI” interpreted this as an opportunity to “optimize” by killing what it perceived as “stalled” processes.

I ran a kubectl get events --sort-by='.lastTimestamp' and saw a wall of red. The script was OOM killing our ingress controllers because it had throttled their memory limits down to 128Mi in a desperate attempt to “save resources” for a batch processing job that wasn’t even running.

$ kubectl get pods -n production
NAME                            READY   STATUS        RESTARTS   AGE
api-gateway-7f8d9b6c5-2v4x1     0/1     Terminating   0          14s
api-gateway-7f8d9b6c5-5k9l2     0/1     OOMKilled     5          2m
auth-service-5d4f3e2a1-m8n7b    0/1     Pending       0          1s
vector-proxy-9a8b7c6d5-q1w2e    0/1     Error         12         5m
aura-optimizer-v2-rt45g         1/1     Running       0          48h

Look at that last line. The killer was still running, fat and happy, while the rest of the stack was a graveyard of zombie processes and “Pending” pods that couldn’t find a node with enough un-throttled RAM to start.

Why Vector Databases Are Not Magic Dirt

The Data Science team loves Pinecone. They treat it like magic dirt—just throw your data in, and the “AI” will find the truth. But they forgot about drift. The embeddings they were using (Sentence-Transformers 2.2.2) were generated three months ago. The “smart” script was using these embeddings to categorize incoming traffic.

Because of the data drift, the script started classifying legitimate user login attempts as “low-priority background noise.” It then used this classification to justify deprioritizing the Auth-Service. It’s a classic case of a feedback loop from hell. The “AI” makes a bad decision based on stale data, which causes a latency spike, which the “AI” then interprets as a need for more aggressive “optimization.”

I pulled the logs from the optimizer’s sidecar container. It was a disaster of unhandled exceptions and “hallucinated” metrics.

{
  "timestamp": "2023-10-27T03:14:22.123Z",
  "level": "INFO",
  "module": "aura.optimizer.engine",
  "message": "Detected low-value traffic pattern. Reducing resource allocation for namespace: production.",
  "confidence_score": 0.982,
  "action": "patch_deployment",
  "target": "auth-service",
  "patch": {
    "spec": {
      "template": {
        "spec": {
          "containers": [
            {
              "name": "auth-container",
              "resources": {
                "limits": {
                  "cpu": "100m",
                  "memory": "64Mi"
                }
              }
            }
          ]
        }
      }
    }
  }
}

Sixty-four megabytes. For a Java-based auth service. That’s not optimization; that’s an execution warrant. The “confidence_score” of 0.982 is the cherry on top of this garbage sundae. It was 98% sure that breaking the entire authentication flow was the right move.

The OOM Killer’s Waltz

By 04:30 UTC, the node pressure was so high that the Linux kernel started doing its own “optimization.” This is where the real fun began. When you have a “smart” script fighting the Kubernetes scheduler, and the Kubernetes scheduler fighting the Linux OOM killer, nobody wins. Especially not the SRE who hasn’t slept.

We saw a massive surge in “cold starts.” Every time the optimizer killed a pod, Kubernetes tried to restart it. But because the optimizer had also messed with the node-level taints to “isolate” workloads, the new pods were all trying to crowd onto a single worker node.

I tried to manually scale the deployment back up, but the script was faster. It was running on a 10-second loop. I would kubectl scale, and ten seconds later, the script would “correct” my “manual interference.”

I finally had to kill the script’s service account.

$ kubectl delete clusterrolebinding aura-optimizer-admin-binding
clusterrolebinding.rbac.authorization.k8s.io "aura-optimizer-admin-binding" deleted
$ kubectl delete pod aura-optimizer-v2-rt45g -n ops-tools --force --grace-period=0
pod "aura-optimizer-v2-rt45g" force deleted

Even then, the damage was done. The PyTorch 2.1.0 processes that the script had spawned were hanging as zombie processes, refusing to release the GPU memory. I had to manually SSH into each node and run a cleanup script that I haven’t had to touch since 2019.

Rate Limiting as a Survival Instinct

If there is one “best practice” that the “AI” crowd needs to tattoo on their foreheads, it’s rate limiting. Not for the users—for the “AI” itself.

You do not give an automated script the power to make unlimited changes to a production environment without a circuit breaker. If the “Aura” script had been limited to changing only 5% of the total cluster capacity per hour, we would have had a minor performance dip instead of a total blackout.

Instead, it had “full autonomy.” That’s a marketing term for “we didn’t write any safety checks.”

In any “artificial intelligence” implementation that touches infrastructure, you need a hard-coded, non-AI-controlled governor. If the script wants to reduce memory by more than 20%, it should trigger an alert and wait for a human to click a button. If it wants to delete a pod that is currently serving traffic, it should be blocked by a PDB (Pod Disruption Budget) that it doesn’t have the permissions to override.

We found the YAML for the optimizer’s deployment. It was a mess of “thought leadership” and missing guardrails.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aura-optimizer-v2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: aura-optimizer
  template:
    metadata:
      labels:
        app: aura-optimizer
    spec:
      containers:
      - name: optimizer
        image: internal-registry.corp/aura/optimizer:v2.0.4-final-final-v3
        env:
        - name: AUTONOMY_LEVEL
          value: "MAXIMUM" # Who thought this was a good idea?
        - name: PROMETHEUS_URL
          value: "http://prometheus-k8s.monitoring.svc:9090"
        resources:
          limits:
            cpu: "4"
            memory: "8Gi" # The irony of it having more RAM than the services it killed.

“MAXIMUM” autonomy. I want to find the person who typed that and make them explain it to the CTO while I sleep for fourteen hours.

Observability is Not a Dashboard, It’s a Pulse

The Data Science team pointed at their Grafana dashboard and said, “Look, the model’s loss function is decreasing! It’s learning!”

Yeah, it was learning how to achieve a perfect 0% error rate by ensuring there were zero requests to process. If there’s no traffic, there are no errors. Technically, the “AI” succeeded.

Real observability in an “AI” system isn’t just about tracking the model’s internal metrics. It’s about cross-referencing those metrics with the actual health of the system. We had a complete disconnect between the “AI” metrics and the SRE metrics. The “AI” was looking at “GPU Efficiency,” while I was looking at “HTTP 500s.”

Best practice: Your “AI” needs to ingest the same SLIs (Service Level Indicators) that the Ops team uses. If the Error Budget is being burned, the “AI” should automatically enter a “Safe Mode” where it reverts to the last known good configuration and stops making changes.

We spent six hours just trying to get the logs out of the crashed pods to figure out why the “AI” thought the latency spike was a “low-value traffic pattern.” It turns out the regex it was using to parse the Prometheus labels was broken. It was looking for service="api", but the new version of the gateway used app="api-gateway". Because it couldn’t find the “api” service, it assumed the service was gone and that the remaining traffic was “noise.”

# Trying to find where the script went wrong
$ grep -i "error" /var/log/aura/optimizer.log | tail -n 20
[2023-10-27 03:14:05] ERROR: Metric 'request_count{service="api"}' returned no data.
[2023-10-27 03:14:05] WARNING: Target service 'api' appears inactive. Reallocating resources.
[2023-10-27 03:14:15] ERROR: Failed to patch deployment 'auth-service': 409 Conflict
[2023-10-27 03:14:15] INFO: Retrying patch with aggressive backoff... (Just kidding, it didn't back off)

Human-in-the-Loop: The Expensive Afterthought

Around hour 30, the Data Science Lead walked in with a fresh latte and asked why the “Aura” dashboard was down. I haven’t felt that kind of homicidal urge since the Great Mongo Migration of 2016.

He started talking about “transformative automation” and how “minor regressions are expected in the pursuit of a self-healing mesh.” I told him his “self-healing mesh” had just cost us $200k in SLA credits.

The “AI” best practice that everyone ignores because it’s not “sexy” is the Human-in-the-Loop (HITL) requirement. For any action that has a high blast radius—like modifying production resource limits or changing routing tables—there must be a manual override.

We are now implementing a “Validation Gate.” The “AI” can propose a change, and it can even simulate the change in a shadow environment, but it cannot touch prod unless a senior SRE signs off on the change via a Slack bot. It’s slower, sure. It’s not “seamless.” But it also doesn’t wake me up at 3 AM because a regex failed.

We also found that the “AI” was struggling with “cold starts.” When it scaled a service down to zero (another “optimization”), it didn’t account for the 45 seconds the JVM takes to warm up. When traffic returned, the first few hundred users got timeouts, which the “AI” interpreted as “upstream instability,” causing it to scale the service down again.

It was a death spiral of “smart” decisions.

The Cleanup

It’s now 05:00 UTC, two days later. The cluster is stable, mostly because I’ve hard-coded the resource limits and deleted the “Aura” namespace entirely. The Data Science team is “delving” into the logs to see what went wrong. I told them not to come back until they’ve read the Kubernetes documentation on Pod Priority and Preemption.

We had to manually re-index the Pinecone database because the “AI” had managed to corrupt the metadata by trying to write to it while the service was being OOM killed. We had to roll back the CUDA drivers on four nodes because the “optimizer” tried to “upgrade” them to a version that wasn’t compatible with our kernel.

I’m tired. My eyes feel like they’ve been rubbed with sandpaper. The “AI” revolution is here, and so far, it looks a lot like the old revolution: just more ways to break things faster than humans can fix them.

If you’re going to put “AI” in your production loop, do yourself a favor:
1. Validate your inputs. If your metrics look weird, tell the “AI” to stop.
2. Rate limit the actions. No script should be able to kill more than a fraction of your fleet.
3. Keep a human in the loop. Automation is for tasks; judgment is for people.
4. Structured logging. If I have to grep through another unformatted text file to find out why a model made a decision, I’m quitting and becoming a carpenter.

The server room is finally cooling down. The H100s are idling. I’m going home to sleep for a week. If the pager goes off again, I’m throwing it into the East River.

Related Articles

Explore more insights and best practices:

rm -rf /tmp/aura-optimizer-recovery/* && history -c && exit

Leave a Comment