[2023-10-14 03:14:22.891] ERROR: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.50 GiB (GPU 0; 15.78 GiB total capacity; 11.20 GiB already allocated; 2.45 GiB free; 12.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-10-14 03:14:22.892] TRACEBACK:
File “/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(args, *kwargs)
File “/app/models/transformer_v4_final_FINAL_v2.py”, line 442, in forward
x = self.attention(x)
File “/app/utils/dirty_hacks.py”, line 12, in attention
return torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
[2023-10-14 03:14:23.004] CRITICAL: Worker process (PID 4421) exited with code 1.
[2023-10-14 03:14:23.005] MONITORING: Alert ‘Inference_Service_Down’ fired. Severity: P0.
[2023-10-14 03:14:23.005] LOG: Attempting to dump local variables to /tmp/crash_dump_031422.json…
[2023-10-14 03:14:23.006] ERROR: [Errno 28] No space left on device: ‘/tmp/crash_dump_031422.json’
I didn't get a call at 3:14 AM because the system was "intelligent." I got a call because the system was a bloated, unmonitored corpse of a project that finally stopped twitching.
For six months, the "Research Team" had been patting themselves on the back for their "machine learning" breakthroughs. They had a Jupyter Notebook that produced a beautiful ROC curve. They had a slide deck that promised 99.2% accuracy on fraud detection. What they didn't have was a single line of production-ready code, a stable environment, or any understanding of how data actually moves through a network.
This is the post-mortem of Project Sentinel. It’s a story about how "machine learning" is 10% math and 90% plumbing—and how, if you ignore the plumbing, you end up drowning in technical debt and broken CUDA kernels.
## The Dependency Hell of Python 3.8.10
The nightmare started with the environment. Or rather, the lack of one. When I was handed the repository, the `README.md` was a single sentence: "Run the notebook." There was no `requirements.txt`, no `pyproject.toml`, and certainly no `Dockerfile`.
I spent the first forty-eight hours trying to reconstruct the environment. The researchers had been using a mix of `conda` and `pip` on their local MacBooks, installing packages at random. When I finally managed to extract a `pip freeze` from one of their machines, it looked like a suicide note.
```text
# Partial output from the 'research_env_v1'
numpy==1.21.0
pandas==1.3.5
torch==1.10.0+cu111
torchvision==0.11.1+cu111
scikit-learn==0.24.2
scipy==1.7.3
matplotlib==3.4.3
# Why is this here?
tensorflow-gpu==2.6.0
# And this?
protobuf==3.19.1
# This version of urllib3 has a known CVE
urllib3==1.26.7
The first thing I did was burn it down. You cannot build a reliable “machine learning” pipeline on Python 3.8.10 in 2023. We migrated to Python 3.11.4. Why? Because the performance improvements in the runtime are non-negotiable when you’re processing 50,000 events per second.
We moved to PyTorch 2.1.0 to leverage the torch.compile feature, which promised a 20% speedup on our inference kernels. But moving versions isn’t just about changing a number in a file. It’s about the cascading failures of every sub-dependency. protobuf 3.20.3 doesn’t like tensorflow 2.14.0 unless you pin the specific C++ implementation. scikit-learn 1.3.0 changed the way it handles certain array inputs, breaking the preprocessing scripts that the researchers had “carefully” crafted.
If you aren’t pinning your versions to the third decimal point, you aren’t doing “machine learning”; you’re playing Russian Roulette with a fully loaded cylinder.
Table of Contents
Silent Failures in the Feature Store
Once the environment was stable enough to actually run a script without a ModuleNotFoundError, we hit the data. This is where the “math” people usually check out. They assume the data is a static CSV file that exists in a vacuum.
In reality, the data was a streaming mess of JSON blobs coming from a Kafka 3.6.0 cluster. The researchers had trained their model on a “cleaned” dataset. When I looked at the cleaning script, I found the smoking gun: data leakage. They were using the target_variable to calculate a rolling mean of the transaction_amount before the train-test split.
The model wasn’t learning to detect fraud. It was learning to read the future.
We had to implement a robust feature store using Redis 7.2.1 for low-latency lookups. Every feature had to be versioned. Every transformation had to be idempotent. We implemented a schema validation layer using Pydantic 2.4.2 to ensure that if a field changed from an int to a float in the upstream API, the pipeline would fail loudly and immediately rather than silently corrupting the model’s weights.
Here is what our feature metadata log looked like after we fixed the ingestion:
{
"feature_set_version": "2.4.1",
"timestamp": "2023-10-14T03:10:00Z",
"source_kafka_topic": "transactions_v3",
"schema_hash": "a1b2c3d4e5f6",
"features": [
{"name": "avg_amount_1h", "type": "float64", "null_count": 0},
{"name": "geo_velocity", "type": "float64", "null_count": 12},
{"name": "is_vpn_proxy", "type": "bool", "null_count": 0}
],
"validation_status": "SUCCESS"
}
Without this metadata, you are flying blind. If your “machine learning” model starts acting up, the first question isn’t “did the weights drift?” It’s “did the definition of ‘average_spend’ change in the upstream SQL query?”
The Pickling Nightmare and Serialization Debt
The researchers loved pickle. They pickled their models, they pickled their scalers, they even pickled their custom dictionary of hyperparameters.
pickle is a security disaster and a compatibility nightmare. When we tried to move the model from the training environment (Python 3.11.4) to the inference environment (a slimmed-down Debian Bookworm image), the unpickle operation failed. Why? Because one of the custom classes in the model architecture had been moved from utils/models.py to core/arch.py.
pickle doesn’t store the code; it stores a reference to the module path. If you change your folder structure, your model is a brick.
We spent three weeks refactoring the entire export process to use ONNX (Open Neural Network Exchange). We used torch.onnx.export to convert the PyTorch 2.1.0 models into a format that could be run by onnxruntime 1.16.1. This decoupled the model from the Python runtime entirely.
The “math” didn’t change. The weights were the same. But the plumbing changed from a fragile, path-dependent mess to a portable, high-performance artifact. We could now run inference in a C++ environment if we wanted to, bypassing the Python Global Interpreter Lock (GIL) and saving us 15ms of latency per request.
Infrastructure as an Afterthought: NVIDIA Driver 535.104.05
You haven’t known pain until you’ve debugged a CUDA kernel panic at 4:00 AM.
The Research Team had developed the model on local RTX 3090s. Production was running on Tesla T4s in the cloud. They assumed that because it was “all NVIDIA,” it would just work. It didn’t.
The production nodes were running NVIDIA Driver 535.104.05 with CUDA 12.2. The model had been compiled against CUDA 11.8. Usually, there’s backward compatibility, but a specific optimization in the attention mechanism—a custom Triton kernel—was throwing an invalid device function error.
We had to standardize the entire stack. Every developer, every CI runner, and every production node had to be synchronized. We moved to a base Docker image: nvidia/cuda:12.2.0-base-ubuntu22.04.
We also had to deal with the “OOM” (Out of Memory) errors. The researchers had set the batch size to 512 because “it worked on their 24GB cards.” The T4s only have 16GB. Instead of just lowering the batch size and killing throughput, we had to implement gradient accumulation and mixed-precision training using torch.cuda.amp.
This is the reality of “machine learning”. It’s not about the elegance of the loss function; it’s about whether your LD_LIBRARY_PATH is correctly pointing to libcudnn.so.8.
The Logging Void and the Death of Reproducibility
When the system crashed (see the log at the top of this post), I went to check the logs. There were none. Or rather, there were millions of lines of print(df.shape) and print("here1") scattered across various stdout streams, but nothing useful.
No one was tracking the hyperparameters. No one was tracking the data version. No one was tracking the system metrics.
We implemented a three-tier logging strategy.
1. Application Logs: Using the standard Python logging module, configured to output JSON for ingestion by an ELK stack.
2. Experiment Tracking: Using MLflow 2.8.1. Every training run was required to log its git commit hash, its dataset hash, and its specific version of scikit-learn 1.3.0.
3. System Metrics: Using Prometheus and Grafana. We exported custom metrics using the prometheus_client 0.17.1.
# prometheus_rules.yml
groups:
- name: ml_model_alerts
rules:
- alert: HighInferenceLatency
expr: histogram_quantile(0.99, sum(rate(inference_latency_seconds_bucket[5m])) by (le)) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "Inference latency is too high on {{ $labels.instance }}"
- alert: PredictionDrift
expr: model_prediction_drift_score > 0.15
for: 10m
labels:
severity: critical
If you can’t reproduce a result from six months ago using the same code and the same data, you aren’t doing science. You’re doing alchemy. And alchemy doesn’t belong in a production financial system.
Post-Mortem Remediation: The Plumbing Manifesto
After the “Great Crash of October,” I was given carte blanche to fix the department. I didn’t hire more researchers. I hired two Site Reliability Engineers (SREs) who knew their way around a Linux kernel and a network switch.
We established the “Plumbing Manifesto” for all “machine learning” projects:
- Isolation is Mandatory: No more local development. Everything happens inside a devcontainer that mirrors the production environment (Python 3.11.4, CUDA 12.2).
- Validation is Continuous: We implemented a CI/CD pipeline using GitHub Actions that doesn’t just run unit tests. It runs “data tests.” It checks for null values, distribution shifts, and schema violations before a single weight is updated.
- Logging is a First-Class Citizen: If a model isn’t logging its confidence scores and its input feature hashes to a centralized store, it doesn’t get deployed.
- No More Pickles: All models must be exported to ONNX or TorchScript. No exceptions.
Here is the CI configuration we used to enforce this:
name: ML_Pipeline_Validation
on: [push]
jobs:
test:
runs-on: ubuntu-latest
container:
image: my-registry/ml-base:py311-cuda12.2
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.lock
- name: Lint with Flake8
run: flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
- name: Run Data Validation
run: python scripts/validate_data_schema.py --input data/sample.parquet
- name: Test Model Export
run: python scripts/test_onnx_export.py --model_path models/latest.pt
We also had to address the “black box” problem. When the model flagged a transaction as fraud, the customer support team needed to know why. We integrated SHAP (SHapley Additive exPlanations) 0.43.0 into the inference pipeline. This added 40ms of latency, but it saved hundreds of hours in manual reviews. Again, this was a plumbing challenge—how to calculate SHAP values in a high-throughput environment without crashing the worker nodes.
The Nuances of ONNX Exports and Runtime Optimization
Let’s talk about why we chose onnxruntime 1.16.1 over just running the raw PyTorch model. When you run model(input) in PyTorch, you’re invoking a massive amount of Python overhead. For a single inference request, that might not matter. But when you’re scaling to thousands of requests per second, that overhead becomes a bottleneck.
The ONNX export process forces you to define your input shapes. This is a good thing. It prevents the “dynamic shape” bugs that plague “machine learning” models in production.
import torch
import torch.onnx
from models.sentinel import FraudModel
def export_to_onnx():
model = FraudModel()
model.load_state_dict(torch.load("weights/best_v4.pt"))
model.eval()
dummy_input = torch.randn(1, 128, requires_grad=True)
torch.onnx.export(
model,
dummy_input,
"production_model.onnx",
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)
if __name__ == "__main__":
export_to_onnx()
By using opset_version=17, we gained access to more efficient implementations of the LayerNorm and Softmax operations. We then used the onnxruntime quantization tools to convert the model from FP32 to INT8. This reduced the model size from 450MB to 115MB and increased our throughput by 3.5x.
The researchers complained that INT8 quantization might drop the accuracy by 0.01%. I told them that a model that is 0.01% more accurate but crashes the server is 100% useless.
Managing the GPU Driver Nightmare
The final piece of the puzzle was the hardware interface. NVIDIA Driver 535.104.05 introduced several changes to how memory is managed between the host and the device. We found that our older monitoring scripts were reporting incorrect GPU utilization because they were relying on an outdated version of nvidia-ml-py.
We had to update our monitoring stack to use dcgm-exporter (Data Center GPU Manager) to get accurate, per-process memory and temperature readings. We integrated this into our Kubernetes 1.28 cluster using the NVIDIA Device Plugin.
This allowed us to implement “Taint and Toleration” logic. If a node’s GPU temperature exceeded 80°C, the orchestrator would stop scheduling new inference jobs to that node and drain the existing ones. This prevented the “silent throttling” that had been causing our p99 latency to spike randomly in the afternoons.
Conclusion: The Math is the Easy Part
“Machine learning” is a discipline that has been hijacked by people who love algorithms but hate systems. They want to talk about stochastic gradient descent and transformer architectures, but they don’t want to talk about why their requirements.txt is missing scipy.
The project that almost cost me my sanity didn’t fail because the math was wrong. It failed because the plumbing was non-existent. It failed because of a pickle version mismatch, a missing ldconfig entry, and a silent data drift that no one was monitoring.
If you want to build a “machine learning” system that actually works, stop looking at the ROC curve for five minutes and look at your logs. Check your pip freeze. Verify your CUDA version. Test your ONNX export.
Because when the system fails at 3:00 AM, the math won’t save you. Only your plumbing will.
“`text
[2023-11-01 10:00:00.000] INFO: System Status: HEALTHY
[2023-11-01 10:00:00.001] INFO: Model Version: 2.5.0-onnx-int8
[2023-11-01 10:00:00.002] INFO: Python Version: 3.11.4
[2023-11-01 10:00:00.003] INFO: CUDA Version: 12.2
[2023-11-01 10:00:00.004] INFO: GPU Utilization: 42%
[2023-11-01 10:00:00.005] INFO: Inference Latency p99: 12ms
[2023-11-01 10:00:00.006] INFO: All systems nominal. Go back to sleep.
Related Articles
Explore more insights and best practices: