10 Essential Machine Learning Best Practices for Success

POST-MORTEM INCIDENT REPORT #ML-FAIL-0909
DATE: October 14, 2023
INCIDENT DURATION: 48 Hours, 12 Minutes
STATUS: Critical / Post-Recovery
AUTHOR: Senior SRE (Infrastructure & Operations)

I’ve spent the last 48 hours staring at a terminal buffer that smells like burnt silicon and hubris. While the rest of the “Data Science” team was likely dreaming of neural networks and venture capital, the infrastructure team was manually rebuilding the prod_users_v4 database from a cold storage snapshot that was six hours out of date. Why? Because someone decided that a “machine learning” model should have direct, unthrottled write access to our production environment to “optimize storage lifecycle management.”

The incident began at 03:14 UTC when the “Auto-Janitor” service—a black-box machine learning model built on a stack of unpinned dependencies—decided that 98% of our active user records were “anomalous noise.” It didn’t just flag them. It didn’t just move them to a bucket. It executed a DROP TABLE sequence across three shards because the “machine learning” logic concluded that deleting data was the most efficient way to reduce storage costs.

Here is the log that greeted me while I was still trying to find my first cup of coffee:

2023-10-12 03:14:02 [INFO] Auto-Janitor: Starting optimization pass...
2023-10-12 03:14:15 [DEBUG] Feature vectorization complete. Shape: (4500000, 128)
2023-10-12 03:14:20 [WARNING] Model inference returned high anomaly score for 4,412,000 rows.
2023-10-12 03:14:20 [CRITICAL] Policy 'Aggressive-Purge' active. Executing cleanup...
2023-10-12 03:14:21 [ERROR] sqlalchemy.exc.InternalError: (psycopg2.errors.DependentObjectsStillExist) 
cannot drop table users because other objects depend on it
2023-10-12 03:14:21 [INFO] Auto-Janitor: Retrying with CASCADE...
2023-10-12 03:14:25 [SUCCESS] Table 'users' dropped successfully. Storage optimized by 1.4TB.

The “machine learning” model did exactly what it was told to do: it optimized a metric. It just happened to destroy the company’s primary asset to do it. We are now in the “Never Again” phase. If you want to keep your sudo access, read this carefully.

1. The 3 AM Wake-up Call: Anatomy of a Model Failure

The failure wasn’t just a “bug.” It was a systemic collapse of engineering rigor. The model in question was a gradient-boosted decision tree that had been “fine-tuned” by an intern who left three months ago. It was running on a patched-together Python 3.11.4 environment that nobody bothered to containerize properly.

When we looked at the inference logs, the model was receiving null values for the last_login_timestamp feature because of a change in the upstream API. Instead of failing gracefully, the machine learning pipeline interpreted these nulls as “infinity,” which pushed the anomaly score into the 99th percentile. Because there were no sanity checks—no “human-in-the-loop” for destructive actions—the model proceeded to wipe the database.

The underlying issue is that the team treated this machine learning component as a magical oracle rather than a piece of software. Software requires unit tests. Software requires boundary checks. Machine learning requires all of that, plus a healthy dose of paranoia regarding data distribution shifts. We didn’t have paranoia; we had “innovation.”

2. Dependency Hell and the Myth of “Latest” Versions

We found that the production environment was pulling “latest” versions of several critical libraries at runtime. This is amateur hour. The “Auto-Janitor” was running on Python 3.11.4, but the local development environments were still on 3.9. When the environment was rebuilt last week, it pulled pandas 2.1.1 and scikit-learn 1.3.0.

The transition to pandas 2.1.1 introduced subtle changes in how Copy-on-Write (CoW) behaves. Our feature engineering script was relying on implicit in-place modifications of dataframes. Because of the version mismatch, the features being fed into the machine learning model were essentially uninitialized memory or zeros, leading to the catastrophic misclassification of our user data.

Look at this stack trace from the failed retraining job we found on the Jenkins runner:

Traceback (most recent call last):
  File "train_model.py", line 42, in <module>
    X_train = preprocessor.fit_transform(df)
  File "/usr/local/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/usr/local/lib/python3.11/site-packages/sklearn/compose/_column_transformer.py", line 727, in fit_transform
    return self._hstack(list(Xs))
  File "/usr/local/lib/python3.11/site-packages/sklearn/compose/_column_transformer.py", line 843, in _hstack
    return np.hstack(Xs) if any(sparse.issparse(f) for f in Xs) else np.column_stack(Xs)
MemoryError: Unable to allocate 64.2 GiB for an array with shape (5000000, 1700) and data type float64

The machine learning pipeline failed to train, but instead of stopping the deployment, the CI/CD script—written by someone who clearly thinks try/except: pass is a valid design pattern—simply promoted the previous version of the model. That “previous version” was incompatible with the new data schema. We were running a 2022 model on 2023 data with 2024 dependencies. It’s a miracle it didn’t explode sooner.

3. Data Validation: Because Garbage In is Still Garbage Out

Machine learning is not a vacuum. It is a pipe, and if you pump sewage into one end, you get high-velocity sewage out of the other. We had zero validation on the input features. No schema enforcement, no range checks, and no null-handling strategy.

In the case of Incident #ML-FAIL-0909, the feature engineering step for our machine learning model was calculating a “user engagement score.” This involved a division operation where the denominator was days_since_signup. For a new batch of users, this value was 0. In pandas 2.1.1, the handling of division by zero in certain vectorized operations changed, resulting in inf values that scikit-learn 1.3.0’s input validator didn’t catch because someone had disabled check_array to “improve performance.”

We need unit testing for feature engineering. If you are writing a function that transforms raw SQL rows into a feature vector for a machine learning model, that function needs to be tested against:
1. Null values in every column.
2. Categorical values that weren’t in the training set.
3. Numerical outliers (e.g., a user with a signup date in the year 1970 or 2099).
4. Empty datasets.

If your machine learning code can’t handle a NaN, it shouldn’t be within a mile of a production database. We will be implementing mandatory schema validation using Pydantic or Pandera for every single model input. No exceptions.

4. The Silent Killer: Feature Drift and Monitoring Gaps

The most infuriating part of this 48-hour hellscape was that the model had been degrading for weeks. But because we were only monitoring “system metrics” (CPU, RAM, Disk I/O), we were blind to the “model metrics.” The CPU usage was fine. The memory was stable—until it wasn’t.

Machine learning requires monitoring the distribution of the data. We should have had alerts for feature drift. If the mean value of “engagement_score” shifts by more than two standard deviations over a 24-hour period, the system should automatically disable the model’s write capabilities.

Instead, we had this:

# Inference Log - Oct 10
[INFO] Avg Anomaly Score: 0.12
# Inference Log - Oct 11
[INFO] Avg Anomaly Score: 0.45
# Inference Log - Oct 12
[INFO] Avg Anomaly Score: 0.98
[CRITICAL] Threshold exceeded. Deleting the world.

We saw a 700% increase in the predicted anomaly rate, and our monitoring system said “Everything is green!” because the Python process was still healthy. We need Prometheus exporters for model-specific metrics: prediction distributions, confidence scores, and feature importance shifts. If you’re deploying a machine learning model, you’re deploying a dynamic system. Start treating it like one.

Furthermore, we hit a massive CUDA error on the inference node right before the crash, which should have been our final warning. The GPU memory was fragmented because of a leak in the model-serving wrapper.

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.45 GiB already allocated; 14.56 MiB free; 10.50 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The system tried to failover to CPU inference, which was 100x slower, causing a massive backlog in the message queue. The “Auto-Janitor” then tried to “catch up” by processing records in massive, unvalidated batches, which led to the final database wipe.

5. Reproducibility is Not Optional: Docker, DVC, and Sanity

I asked the lead data scientist for the training script and the dataset used for the current production model. He pointed me to a Jupyter notebook on a decommissioned dev server and a CSV file named data_final_v2_REALLY_FINAL.csv.

This is why we are in this mess. Machine learning models must be reproducible. This means:
1. Data Versioning (DVC): If I cannot pull the exact byte-for-byte dataset used to train a model, that model does not exist. We are implementing DVC (Data Version Control) immediately. Every model artifact in our registry must be linked to a DVC hash.
2. Model Registry: We are using MLflow, but we’re using it wrong. People are tagging models as “production” manually. From now on, the “production” tag is only applied by a CI/CD pipeline after passing a battery of integration tests.
3. Docker: If I see one more “requirements.txt” without version pins, I’m revoking SSH access. Every machine learning model will be shipped in a Docker container with a locked-down image. We are using specific base images, not python:latest.

The “it worked on my laptop” excuse died today. Your laptop doesn’t have a T4 GPU and 128GB of RAM with a specific version of the CUDA driver. If it doesn’t run in the container, it doesn’t run in production.

6. The “Never Again” Manifesto for Production Machine Learning

We are not an “AI-first” company if “AI-first” means “Engineering-last.” From this moment forward, the following rules are written in blood (and the 48 hours of sleep I’ll never get back):

Rule 1: No Destructive Autonomy. No machine learning model will ever have the permission to execute DELETE, DROP, or TRUNCATE commands directly. Models may output “recommendations” to a dead-letter queue or a review table, but a human—or at least a very rigid, non-probabilistic script—must gatekeep the actual execution.

Rule 2: Mandatory Model Versioning. Every model in production must have a traceable lineage. This includes the Git hash of the training code, the DVC hash of the training data, the specific versions of libraries (Python 3.11.4, scikit-learn 1.3.0, pandas 2.1.1), and the hyperparameter log. If any of these are missing, the model is considered rogue and will be killed.

Rule 3: Circuit Breakers for Inference. We are implementing circuit breakers at the infrastructure level. If a model’s output distribution shifts beyond defined thresholds (e.g., if it starts flagging 90% of traffic as malicious or 90% of users as “anomalous”), the inference service will automatically trip and revert to a safe, heuristic-based “fallback” mode.

Rule 4: Feature Engineering is Production Code. Stop treating your preprocessing scripts like throwaway research code. They require docstrings, type hints, and comprehensive unit tests. If your feature engineering pipeline fails on a single malformed row, the whole pipeline must fail loudly and immediately, not silently propagate NaN values into the model.

Rule 5: Real-time Monitoring of Model Health. We will monitor more than just latency. We will monitor prediction drift, label switching, and feature importance. If the “most important feature” for a model suddenly changes from user_activity to random_id, we need to know within minutes, not after the database is gone.

Rule 6: Dependency Lockdown. All machine learning projects must use a lockfile (e.g., poetry.lock or pip compile). If I see a setup.py with install_requires=['pandas'], I will personally delete the repository. We pin to the minor version at a minimum.

Rule 7: Hyperparameter Logging. No more “magic numbers.” Every hyperparameter must be logged in the model registry. We found that the failing model had a learning_rate of 0.0000001 because of a typo in a config file, which caused the model to never converge during a “hot-fix” retraining session, leading to random outputs.

The era of “Machine Learning Wild West” is over. We are engineers. We build reliable systems. If your model is a “black box” that you don’t understand and can’t control, keep it on your local machine. If you try to push unvalidated, unversioned, or unmonitored “machine learning” junk into my production environment again, I will not only revert your commit; I will revoke your sudo access and move your desk to the basement.

Go home. Get some sleep. On Monday, we start fixing this disaster properly.

Sign-off:
Grizzled SRE
Infrastructure Lead
(Sudo access revocation list is currently being drafted)

Related Articles

Explore more insights and best practices:

Leave a Comment