Top Machine Learning Best Practices for Better Models

text
[2024-10-14 03:14:22,891] ERROR:worker:Process ‘Icarus-Inference-Engine-7’ terminated with signal 9 (SIGKILL)
[2024-10-14 03:14:22,892] CRITICAL:kernel:Out of Memory (OOM) killer invoked. Victim: python3.11 (pid 4402)
[2024-10-14 03:14:23,004] TRACEBACK:
File “/app/inference/model_loader.py”, line 142, in predict
features = feature_store.get_latest(user_id)
File “/app/data/feature_store.py”, line 89, in get_latest
df = pd.read_sql(query, engine)
File “/usr/local/lib/python3.11/site-packages/pandas/io/sql.py”, line 1561, in read_sql
return pandas_sql.read_query(sql, index_col=index_col, params=params)
MemoryError: Unable to allocate 14.2 GiB for an array with shape (1890221, 1024) and data type float64
[2024-10-14 03:14:25,112] SYSTEM: IPO-Launch-Dashboard status: CRITICAL. Conversion rate: 0.00%.

I’m sitting in a dark room at 4:00 AM, staring at a flickering cursor. My eyes feel like they’ve been rubbed with sandpaper. Outside, the sun is threatening to rise on what was supposed to be the most important day in the history of this company—the day we went public. Instead, the S-1 filing is a joke, the bankers are pulling out, and the "Project Icarus" recommendation engine is currently eating itself alive in a Kubernetes cluster that’s hemorrhaging money.

We didn't just fail. We committed technical suicide.

I was the Lead Data Scientist. I’m the one who signed off on the "optimized" weights. I’m the one who told the CTO that we could scale. I lied. Not because I wanted to, but because we had spent eighteen months shoveling technical debt into a furnace to keep the hype train moving. This isn't a post about "learnings" or "growth mindsets." This is an autopsy of a carcass.

## The Dependency Hell We Ignored

It started with the environments. In the early days, we were "agile." Agile is just a corporate euphemism for "we don't have time to do it right." Every researcher on the team was running their own local version of the stack. We had people on M1 Macs, people on Linux workstations, and one guy still trying to make Windows Subsystem for Linux work with CUDA 11.8.

We never locked our versions. We used `pip install` like it was a candy dispenser. By the time we tried to containerize the model for production, the `requirements.txt` was a bloated, contradictory mess of conflicting binaries. We had `scikit-learn 1.4.2` trying to talk to a feature engineering script written for `scikit-learn 0.24`.

When we finally pushed to the staging environment, the entire thing collapsed because `protobuf` decided to break backward compatibility for the tenth time that year. We "fixed" it by pinning versions at random until the errors stopped. We didn't solve the problem; we just buried it under a layer of brittle hacks.

```text
# The "Final" pip freeze that killed us
numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.2
torch==2.2.0+cu121
tensorflow==2.16.1
protobuf==4.25.3
grpcio==1.62.1
# Why is this here? Nobody knows.
joblib==1.3.2
# This version conflicts with the transformer layer but we forced it anyway
transformers==4.38.2

Table of Contents

Feature Store or Feature Swamp?

The marketing team called it our “Real-Time Intelligence Layer.” Internally, we called it the Swamp. We didn’t have a feature store. We had a collection of unoptimized SQL queries and a series of CSV files sitting in an S3 bucket that were updated by “cron jobs” that failed 30% of the time.

The biggest sin was the data leakage. We were predicting user churn. Our model had an AUC of 0.98 in training. We were heroes. The CEO was showing the charts to investors, bragging about our “predictive moat.”

The reality? One of the junior engineers had included is_active_customer as a feature in the training set. We were literally using the answer to predict the question. Because we didn’t have a formal pipeline or a feature registry, this “leak” stayed in the code for six months. When we finally caught it and removed the feature, the AUC dropped to 0.52. We were basically flipping a coin, but the IPO roadshow had already started. We couldn’t tell the truth then. We just started “tuning” (read: over-fitting) until the numbers looked respectable again.

We ignored the basic hygiene of machine learning. We didn’t version our data. We didn’t have a schema for our features. We just kept dumping raw JSON into a Snowflake instance and hoped the dbt models would sort it out. They didn’t. They just made the garbage more expensive to store.

The Reproducibility Crisis: We Built a Ghost

Six weeks before the IPO attempt, the “Golden Model” stopped working. This was the specific iteration of our XGBoost model (using xgboost 2.0.3) that had the best performance on the holdout set. We needed to retrain it on the latest data.

But we couldn’t.

The original researcher had left for a HFT firm in Chicago. He didn’t leave a Dockerfile. He didn’t leave a random seed. He didn’t even leave the original preprocessing script. He had done some “manual cleaning” in a Jupyter Notebook that was now a 404 error on a decommissioned server.

We spent three weeks trying to replicate his results. We matched the hyperparameters. We matched the data splits. We even tried to find the exact version of pandas 1.5.3 he was using. It didn’t matter. The model was a ghost. Every time we retrained, the weights shifted. The performance fluctuated. We were chasing a phantom because we treated our models like artisanal crafts instead of engineered products.

In ML, if you can’t reproduce it, it doesn’t exist. We had a production environment running a binary that no one in the company knew how to recreate. That is the definition of a ticking time bomb.

Data Leakage: The False Prophet of 99% Accuracy

The pressure to perform for the board led to a culture of “metric hacking.” When the model didn’t hit the targets, we didn’t look for better features; we looked for ways to make the test set easier.

We had temporal leakage that would make a freshman CS student weep. We were using future information to predict past events because our “point-in-time” joins were broken. Our SQL joins were missing the WHERE event_timestamp < prediction_timestamp clause in half the queries.

The result was a model that looked like a god in the lab and acted like a drunkard in the field. When we went live for the “Beta Launch” during the IPO week, the model started recommending winter coats to people in Florida during a heatwave. Why? Because the training data was heavily skewed by a botched data migration from three years ago that we never bothered to clean up. We just “shoveled” more data into the model, thinking volume would compensate for quality. It’s the classic mistake: thinking that a bigger pile of trash will eventually turn into gold.

The Production Meat Grinder: OOM Kills and Zombie Processes

Then came the actual deployment. We didn’t have an MLOps team. We had “DevOps” guys who hated Python and “Data Scientists” who didn’t know what a pointer was.

We deployed our PyTorch models (version 2.2.0) inside a Flask wrapper. Flask. For a high-throughput inference engine. It was like trying to power a cruise ship with a lawnmower engine.

The memory leaks were catastrophic. Python’s garbage collection couldn’t keep up with the massive tensors we were throwing around. We didn’t use a model server like TorchServe or NVIDIA Triton. No, we wrote a custom wrapper because we thought we were clever.

At 3:00 AM, the OOM kills started. The system would spin up a new pod, it would load the 8GB model into VRAM, handle three requests, and then the memory would spike until the Linux kernel stepped in and executed the process.

# Log from the Load Balancer during the crash
[03:14:10] GET /v1/predict/user/88219 - 200 OK (450ms)
[03:14:12] GET /v1/predict/user/99102 - 500 Internal Server Error (12000ms)
[03:14:15] GET /v1/predict/user/10223 - 503 Service Unavailable
[03:14:18] WARNING: High Latency detected on all nodes.
[03:14:20] CRITICAL: Node-4 has entered a CrashLoopBackOff state.

The CEO was screaming in the #war-room Slack channel. “Why is the site down? The bankers are watching!” I was trying to explain the difference between CUDA 12.1 and 11.8 drivers to a man who thinks “The Cloud” is an actual cloud. He didn’t care about the technical debt. He cared about the stock price that was currently evaporating.

We had zombie processes everywhere. We were using multiprocessing in Python to try and handle the load, but we weren’t cleaning up the child processes. Each one was holding onto a slice of the GPU memory. Eventually, the entire cluster was just a graveyard of hung processes, and no amount of kubectl delete pod could save us.

The IPO Death Spiral: Why Technical Debt is a High-Interest Loan

The failure of Project Icarus wasn’t a single bug. It was the cumulative weight of a thousand “we’ll fix it later” decisions.

We didn’t have unit tests for our data.
We didn’t have monitoring for model drift.
We didn’t have a standardized environment.
We didn’t have a way to roll back a bad deployment.

When the model started failing, we didn’t even know it was failing for the first four hours. Our “monitoring” was just a dashboard that checked if the HTTP status was 200. It didn’t check if the predictions made any sense. The model was returning NaN for 40% of the requests, and our system was just passing those NaNs right to the front end. The UI broke. The checkout button disappeared. The IPO was dead in the water.

The bankers saw the “technical instability” and the “lack of scalable infrastructure” and they did what bankers do: they protected their own skin. They lowered the valuation, then they delayed the offering, and then they just stopped calling.

The Hyperparameter Tuning Rabbit Hole

In the final weeks, we became obsessed with hyperparameter tuning as a Hail Mary. We thought if we could just squeeze another 0.5% of accuracy out of the model, the “business value” would magically fix the broken infrastructure.

We ran thousands of Optuna (version 3.6.1) trials. We burned through $50,000 of AWS credits in a weekend. We were searching for the perfect combination of learning_rate, max_depth, and subsample.

It was a distraction. We were polishing the brass on the Titanic. It didn’t matter if the learning_rate was 0.001 or 0.0005 if the input data was corrupted by a race condition in the feature pipeline. We found a “winning” set of parameters, but when we tried to deploy them, we realized the training script had a hard-coded path to a local directory on a laptop that had been wiped.

This is the reality of “Burnout Data Science.” You spend your time fighting the tools instead of solving the problems. You use pickle to save your models because it’s easy, then you realize you can’t load them because the production environment has a slightly different version of a dependency, and pickle is a security nightmare anyway. We should have used ONNX. We should have used MLflow. We should have used our brains.

The Aftermath: Shoveling the Wreckage

I quit yesterday. I didn’t even give two weeks’ notice. I just sent an email with the root password to the production cluster and a link to the documentation that I never finished writing.

Project Icarus is being “sunsetted.” The company is looking for a buyer, probably some legacy firm that wants to “acquire the AI talent.” They don’t realize the talent is spent. We are all shells of ourselves, haunted by the sound of Slack notifications and the sight of a red Grafana dashboard.

If you are a Data Scientist reading this, take it as a warning. When they tell you to “move fast and break things,” remember that you are the one who has to fix them at 3:00 AM. When they tell you that “data quality is a secondary concern,” they are lying. When they tell you that “we don’t need a formal deployment pipeline,” they are setting you on fire.

Machine learning is not a magic wand. It is a high-maintenance, temperamental engine that requires rigorous engineering, absolute transparency, and a level of discipline that most “fast-growing” startups are unwilling to provide.

We flew too close to the sun with wings made of unversioned Python scripts and “good enough” data. Now, we’re just another crater in the history of the tech bubble.

# Final state of the Icarus repository
$ git log --oneline -n 5
f3a2b1c (HEAD -> master) fix: try to stop the OOM kills by reducing batch size to 1
a4d5e6f hotfix: remove the leaked feature again (who put this back?)
b7c8d9e emergency: bypass the feature store and read directly from S3
c0d1e2f docs: update README (just kidding, it's still empty)
e3f4g5h initial commit: Project Icarus - The Future of Churn Prediction

The future is currently a 503 Service Unavailable error. I’m going to go sleep for a month. Don’t call me. Don’t even send an automated email. I’ve had enough of “intelligence” for one lifetime. I’m going to go find a job where the only “models” I deal with are made of wood and don’t require CUDA 12.1 to function.

The technical debt has been called in. And we are all bankrupt.

Technical Appendix: The Tools That Failed Us (Because We Failed Them)

Scikit-learn 1.4.2: Used for the initial prototyping. Great tool, but we treated it like a black box. We didn’t understand the underlying algorithms, so when the RandomForest started over-fitting, we just added more trees until the memory exploded.
Pandas 2.2.1: The backbone of our feature engineering. We used it for everything, including things that should have been done in SQL. We were loading 20GB dataframes into 16GB of RAM and wondering why the kernel was killing our processes.
PyTorch 2.2.0+cu121: Our deep learning framework. We didn’t use Lightning or any abstraction. We wrote raw training loops that were full of logic errors and didn’t handle device placement correctly.
TensorFlow 2.16.1: Used by the “legacy” team. We tried to bridge the two frameworks using a custom bridge that was basically just numpy arrays passed through a socket. The latency was astronomical.
Docker: We used it, but we didn’t understand it. Our images were 12GB because we kept installing nvidia-cuda-toolkit inside the container instead of using the base images correctly.
Kubernetes: The “orchestrator” that mostly just orchestrated our demise. We didn’t set resource limits or requests correctly, so one runaway model would starve the entire cluster of resources.

This is not a “how-to” guide. This is a “how-not-to.” If you see your team doing any of this, run. Or at least, make sure your resume is updated and your LinkedIn is set to “Open to Work.” Because the meltdown is coming, and no amount of “vibrant” corporate culture can stop a SIGKILL.

[EOF]

Explore more insights and best practices: