INCIDENT #4092-B: THE TUESDAY TENSOR COLLAPSE
Status: Resolved (After 72 hours of manual intervention)
Severity: Critical (P0)
Duration: 72:14:08
Impact: Total failure of the recommendation engine, 45% drop in checkout conversion, 100% CPU saturation across the inference cluster, and three burnt-out SREs.
Table of Contents
Timeline of Failure
- T-02:00 (Tuesday, 02:14 AM): Automated CI/CD pipeline triggers for the
reco-engine-v4deployment. The “Data Science” team pushes a new model artifact. They call it “The Oracle.” I call it a ticking time bomb. - T-00:00 (Tuesday, 04:14 AM): Deployment hits production. Canary tests pass because the canary only checks for
HTTP 200 OK. It doesn’t check if the response body is a JSON-formatted scream into the void. - T+00:15: Prometheus alerts fire.
p99latency on the inference service jumps from 45ms to 12,000ms. - T+00:45: The first node dies.
SIGKILL. The OOM killer is awake and it’s hungry. - T+01:30: I am paged. I look at the logs. I see nothing but stack traces and broken dreams.
- T+04:00: We attempt a rollback. The rollback fails because the new model migrated the schema of the feature store in a non-backward-compatible way. We are stuck in the future, and the future is broken.
- T+12:00: We realize the container image pulled a nightly build of a core library.
- T+72:00: System stabilized after manual database surgery and a complete container registry purge.
1. The Initial Alert: Why the Prometheus Hooks Failed
Our monitoring is built for microservices, not for the black-box voodoo of modern machine learning. The Prometheus hooks were green. Why? Because the Python wrapper around the model was technically “healthy.” It was accepting requests. It was returning responses. It just happened to be taking twelve seconds to calculate a dot product that should take microseconds.
# Prometheus Scraping Log - 04:20:12
http_request_duration_seconds_bucket{le="0.5", service="reco-engine", status="200"} 0
http_request_duration_seconds_bucket{le="1.0", service="reco-engine", status="200"} 0
http_request_duration_seconds_bucket{le="10.0", service="reco-engine", status="200"} 2
http_request_duration_seconds_bucket{le="+Inf", service="reco-engine", status="200"} 4502
The health check endpoint was a static {"status": "ok"}. This is useless. If the model is blocked on a GIL lock or waiting for a GPU kernel to finish a mismanaged calculation, the health check needs to reflect that. We saw 100% CPU usage across 64 cores, yet the load balancer kept shoving traffic into the meat grinder.
The lesson here is that machine learning services require deep-health checks. We need to monitor the inference loop itself. If the time between “Request Received” and “Tensor Input Ready” exceeds a threshold, the node should be marked as tainted. We relied on “logic” that assumed the code would fail if it was broken. In the world of tensors, code doesn’t fail; it just slows down until the heat death of the universe.
We also found that our alerting was suppressed because the “error rate” was low. The model wasn’t throwing 500s. It was returning empty arrays [] at a record pace. To the load balancer, an empty array is a success. To the business, it’s a catastrophe.
2. Dependency Hell: When pip install Becomes a Suicide Note
I spent six hours tracing a segmentation fault that only happened on the A100 nodes. It turns out the “Data Science” team didn’t pin their requirements. Their requirements.txt had a line that just said torch. No version. No hash. No dignity.
When the build server ran at 02:00 AM, it pulled torch==2.0.0+cu117 (a nightly/early release) instead of the stable 1.13.1 we had validated in staging. This “magic” update decided to change how it interacted with the shared memory (/dev/shm) on our Kubernetes nodes.
# Stderr from reco-engine-7f4d9b8-x2z
ImportError: /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so:
undefined symbol: _ZNK3c104Type14isSubtypeOfExtESt10shared_ptrIS0_EPSt6vectorIS2_IS0_ESaIS5_EE
[CRITICAL] Worker process 42 exited with code 139 (SIGSEGV)
We were running Python 3.9.12. The container, however, had updated itself to a different minor version because the base image was tagged as python:3.9. Never use floating tags. Never. If you don’t pin the SHA256 hash of your base image, you aren’t practicing engineering; you’re gambling with my sleep schedule.
The “machine learning” ecosystem is a house of cards built on top of C++ binaries that hate each other. We found three different versions of numpy in the site-packages because of transitive dependencies. One library wanted 1.21, another wanted 1.23. Pip just picked one and hoped for the best. It failed. We need a locked, frozen, and audited poetry.lock or requirements.txt with hashes. If a single byte changes in the dependency tree, the build must die.
3. Data Drift and the Silent Death of Accuracy
By Wednesday morning, the latency was under control, but the model was hallucinating. It was recommending winter coats to users in the middle of a Sahara heatwave. Why? Because the feature engineering pipeline had “drifted.”
The model was trained on a dataset where the user_location field was an ISO country code (e.g., “US”, “GB”). Some “genius” upstream decided to change the feature store to emit full names (e.g., “United States”, “Great Britain”). The model didn’t crash. It just looked at “United States,” saw a string it didn’t recognize, assigned it a default weight of 0.0000001, and started outputting garbage.
# The "Logic" found in the feature_extractor.py
def get_country_code(val):
# No validation, no logging, just vibes.
return mapping.get(val, 0)
This is the fundamental horror of machine learning. In a standard CRUD app, if you send a string instead of an int, the database screams. In machine learning, the tensors just absorb the wrong data and produce the wrong answers silently. There were no logs indicating that 99% of the inputs were hitting the default case in the mapping dictionary.
We need runtime schema validation. If the input distribution shifts by more than two standard deviations, I want an alarm. I want the system to shut down. I would rather the site be down than have it lie to our customers. We are now implementing Great Expectations or a similar validation layer on the ingress of every model. No more silent failures.
4. The GPU OOM Crisis: Hardware Doesn’t Care About Your Math
At T+24, we thought we had it. Then the NVIDIA A100s started dropping like flies. We saw the dreaded RuntimeError: CUDA out of memory.
# Kernel Log from Node-04
[124092.45] nvidia-nvlink: Internal error: 0x12
[124092.46] NVRM: Xid (PCI:0000:01:00): 31, Ch 0000001f, ptr 00007000, envp 00000000
[124092.47] reco-engine invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0
The model was supposedly 2GB. The A100 has 80GB of VRAM. How do you blow 80GB? You do it by failing to manage the cache. The “Data Science” team had implemented a custom attention mechanism that didn’t use torch.no_grad() during inference. It was building a computational graph for every single request, holding onto every intermediate tensor, waiting for a backpropagation step that was never going to come.
It was a slow-motion car crash. Each request ate another 50MB of VRAM. After 1,600 requests, the card was full. Because we use a shared memory space for the multi-process inference server, one worker dying took out the entire pod.
We also found that the local NVMe caching was misconfigured. The model was trying to swap weights from the NVMe to the GPU memory every time a specific “rare” feature was triggered. This caused a massive I/O wait, which locked the GIL, which caused the Prometheus health checks to timeout, which caused Kubernetes to restart the pod.
Hardware is not an abstraction. You cannot “cloud” your way out of memory management. If you are writing machine learning code, you are writing systems code. Act like it.
5. Observability is Not an Option: Logging the Latent Space
When I asked the team for logs, they pointed me to a dashboard that showed “Model Accuracy.” I don’t care about accuracy when the service is 503ing. I need to know what is happening inside the black box.
We had zero visibility into the latent space. We had no idea what the tensors looked like before they hit the final Softmax layer. We were debugging in the dark.
# What I wanted to see:
[DEBUG] RequestID: 882-af | Input Tensor Shape: [1, 512] | Mean: 0.002 | Std: 1.01
[DEBUG] RequestID: 882-af | Layer 12 Output | NaN detected: False
# What I actually saw:
[INFO] Processing request...
[INFO] Processing request...
[INFO] Processing request...
Segmentation fault (core dumped)
We need to log the “internal health” of the model. This doesn’t mean logging every weight—that’s insane. It means logging the statistics of the tensors. If the mean of our embeddings suddenly jumps from 0.05 to 500.0, something is wrong. If we start seeing NaN or Inf values, we need an immediate circuit breaker.
Furthermore, we need to correlate request IDs with model versions. We had three different versions of the model running across different shards because of a botched rolling update, and we couldn’t tell which model produced which error. Every response header must include the model’s Git hash and the weights’ S3 URI. No exceptions.
6. The New Standard: Hard Rules for Machine Learning Deployment
We are not doing this again. I’ve been awake for three days, and I’ve reached a level of clarity that only comes from pure, unadulterated spite. Here are the new rules for deploying anything that involves a matrix multiplication.
Rule 1: Hermetic Builds.
If it’s not in a locked Poetry file with SHA256 hashes, it doesn’t go to prod. If you use a base image like python:latest, I will revoke your SSH access. We use specific, versioned, and scanned images.
Rule 2: The “No-Grad” Mandate.
All inference code must be wrapped in a context manager that explicitly disables gradient calculation. We will also implement a memory-limit watchdog that kills any process exceeding its allocated VRAM before it can trigger a kernel-level OOM.
Rule 3: Schema or Death.
Every feature used by the model must have a strictly defined schema. We will use a validation layer (like Pydantic or Pandera) to check every input. If the input is “United States” and we expect “US”, the service returns a 400 Bad Request immediately. Do not pass go, do not pollute the tensors.
Rule 4: Canary with Brains.
Canary deployments will no longer just check for HTTP 200. They will run a “Golden Set” of 100 queries and compare the output distribution against the current production model. If the Kullback–Leibler (KL) divergence is too high, the deployment is automatically aborted.
Rule 5: Mandatory Instrumentation.
Every model must export internal metrics: tensor means, standard deviations, and the count of NaN/Inf values. We are integrating these into our standard Grafana dashboards.
Permanent Remediation
- Automated Dependency Auditing: We are implementing a CI gate that checks for unpinned dependencies and nightly builds.
- GPU Memory Guardrails: We are moving to a per-process memory limit using
torch.cuda.set_per_process_memory_fraction. - Feature Store Versioning: The feature store and the model are now atomically linked. You cannot update one without the other.
- SRE Training for Data Science: The “Data Science” team is being enrolled in a mandatory “Production Systems 101” course. They will learn what a
SIGSEGVis and why it’s their fault. - Decommissioning “The Oracle”: The model that caused this has been deleted. We are reverting to a simple, explainable heuristic until the team can prove they can handle the complexity of machine learning without setting the data center on fire.
I’m going home. Don’t page me unless the building is literally burning. And even then, check the logs first.
Signed,
Lead SRE, Infrastructure Recovery Team
Sent from my terminal at 04:30 AM
Related Articles
Explore more insights and best practices: