Machine Learning Best Practices – Guide

Your Machine Learning Model is a Memory Leak Waiting to Happen

I once spent seventy-two hours straight debugging a “ghost in the machine” that was costing a fintech client $4,200 every hour. We had just pushed a new credit-scoring model to production. On paper, the metrics were flawless—98% precision, great recall, and the data scientists were high-fiving in the Slack channel. But thirty minutes after the kubectl apply finished, the nodes started screaming. The Kubelet was OOM-killing the inference pods, but the memory usage reported by the Python process didn’t explain why. We were using a standard pickle load of a 2GB Random Forest model, and for some reason, the resident set size (RSS) was ballooning to 12GB per pod.

The culprit? A combination of Python’s copy-on-write behavior during multiprocessing and a massive feature matrix that was being duplicated across every worker thread. We hadn’t accounted for the overhead of the gunicorn workers pre-loading the model into shared memory incorrectly. I ended up rewriting the loading logic to use numpy.memmap so the workers could read the model weights directly from disk without sucking the RAM dry. It was a messy, low-level fix for a “high-level” technology. That’s the reality of machine learning in production: it’s 10% math and 90% fighting with Linux primitives and garbage collection.

The Environment: Why Your Dockerfile is a Liability

Most machine learning tutorials tell you to start with FROM python:3.9 or, god forbid, FROM alpine. If you use Alpine for ML, you are signing up for a world of hurt. Machine learning libraries like numpy, scipy, and pandas rely heavily on C extensions. Alpine uses musl instead of glibc. When you pip install these libraries on Alpine, you can’t use the pre-compiled wheels. Your CI/CD pipeline will spend forty minutes compiling C++ code from source, only to fail because some obscure header file is missing. Use python:3.11-slim-bookworm. It’s Debian-based, it has glibc, and the image size is small enough to not choke your container registry.

Then there’s the dependency hell. If I see one more requirements.txt with scikit-learn>=1.0, I’m going to retire and become a carpenter. In ML, a minor version bump in a dependency can change the default behavior of an optimizer, silently breaking your model’s predictions without throwing a single error. You need deterministic builds. Use poetry.lock or pip-compile from pip-tools. You need to know exactly which version of threadpoolctl is running in your container.

Pro-tip: Always set OMP_NUM_THREADS=1 in your Dockerfile environment variables. Many ML libraries try to parallelize operations using OpenMP. If your container is restricted to 2 CPUs by Kubernetes but the library sees 64 cores on the host, it will spawn 64 threads, leading to massive context-switching overhead and degraded performance.

# A sane Dockerfile for ML inference
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY pyproject.toml poetry.lock /app/
RUN pip install --no-cache-dir poetry && \
    poetry config virtualenvs.create false && \
    poetry install --only main --no-interaction --no-ansi

COPY ./src /app/src
ENV OMP_NUM_THREADS=1
ENV PYTHONUNBUFFERED=1

CMD ["gunicorn", "-k", "uvicorn.workers.UvicornWorker", "--workers", "4", "--bind", "0.0.0.0:8000", "src.main:app"]

The Data Pipeline: Versioning is Not Just for Code

In traditional software, Git is enough. In machine learning, Git is a joke. You cannot commit a 50GB CSV to a repository, and you certainly shouldn’t be pulling data from a S3 bucket using a “latest” tag. If you can’t recreate the exact dataset used to train model v2.4.1, you don’t have a production system; you have a science project. I’ve seen teams lose weeks of work because they “cleaned” a table in Snowflake and didn’t realize it changed the distribution of a feature used by a live model.

  • DVC (Data Version Control): Treat your data like code. DVC creates metadata files that you can commit to Git, which point to specific versions of files in S3 or GCS. It’s the only way to maintain sanity.
  • Feature Stores: If you’re calculating “average spend in last 24 hours” in a SQL query for training and in a Python loop for inference, you’ve already failed. The logic will diverge. Use a feature store like Feast or just a shared library that handles the transformation for both paths.

The “Training-Serving Skew” is the silent killer. You train on a snapshot of a database where nulls were filled with the mean. In production, the API receives a null, and your code throws a ValueError because the production environment doesn’t have the “mean” value from three months ago cached anywhere. You must bundle your preprocessing parameters (the scalers, the encoders) with the model itself. If you use scikit-learn, use a Pipeline object. Don’t export the model and the scaler as two separate files. They are a single unit of execution.

Serialization: Pickle is a Security Risk and a Versioning Nightmare

We need to talk about pickle. It is the default way to save Python objects, and it is terrible for machine learning. First, it’s insecure. Loading a pickle file can execute arbitrary code. If someone compromises your S3 bucket and replaces your model with a malicious pickle, they have RCE (Remote Code Execution) on your inference nodes. Second, pickle is tied to the Python class structure. If you rename a class in src/models/classifier.py, your old models won’t load anymore.

Instead, look at ONNX (Open Neural Network Exchange). It’s a cross-platform format. You can train a model in PyTorch and run it in a high-performance C++ runtime or even in the browser. It forces you to define your inputs and outputs strictly. If you must stay in Python-land, use joblib with mmap_mode='r' for large arrays, but be aware of the versioning constraints. For deep learning, Safetensors from the Hugging Face team is the current gold standard because it prevents the RCE risks of pickle and is incredibly fast at loading.

# Example of exporting to ONNX to avoid pickle hell
import torch
import torch.onnx

def export_model(model, dummy_input, path="model.onnx"):
    model.eval()
    torch.onnx.export(
        model, 
        dummy_input, 
        path, 
        export_params=True, 
        opset_version=12, 
        do_constant_folding=True, 
        input_names=['input'], 
        output_names=['output'],
        dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
    )
    print(f"Model exported to {path}")

# In production, use onnxruntime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
results = session.run(None, {"input": input_numpy_array})

The API Layer: FastAPI and the Pydantic Tax

Everyone uses FastAPI now. It’s great. It’s fast. But people misuse it in machine learning contexts. They define these massive Pydantic models for the request body, which is fine for a CRUD app, but when you’re sending a 1000-element vector as a JSON list, Pydantic’s validation becomes a massive bottleneck. I’ve seen Pydantic validation take longer than the actual model inference.

If you are dealing with high-throughput machine learning, stop sending raw JSON arrays. Use Protobuf or even just a binary blob if you can. If you must use JSON, use ujson or orjson as the response class. Also, for the love of all that is holy, do not run your model inference directly in the async def endpoint. Most ML libraries are CPU-bound and do not play nice with Python’s asyncio loop. They will block the loop, and your “high-performance” API will handle exactly one request at a time.

Use starlette.concurrency.run_in_threadpool or just define your endpoint with def instead of async def so FastAPI runs it in a separate thread. Better yet, use a dedicated inference server like NVIDIA Triton or TorchServe if you’re at scale. They handle batching, model versioning, and GPU memory management much better than a custom FastAPI wrapper ever will.

Note to self: When using FastAPI with Gunicorn, remember that --workers should usually be (2 x $num_cores) + 1, but for ML, this is often too many. ML models are heavy. If each worker loads a 4GB model, and you have 16 cores, you’ll need 132GB of RAM just for the workers. Calculate your memory overhead before you scale the worker count.

Monitoring: 200 OK Does Not Mean the Model is Working

In SRE, we care about the “Golden Signals”: Latency, Traffic, Errors, and Saturation. In machine learning, you can have perfect Golden Signals while your model is spitting out absolute garbage. This is called “Silent Failure.” Your API returns a 200 OK, the latency is a crisp 40ms, but the model is predicting that every single customer is a fraudster because the input distribution shifted.

You need to monitor Model Drift and Concept Drift. Model drift happens when the data your model sees in production is different from the data it was trained on. Concept drift happens when the relationship between the input and the output changes (e.g., consumer behavior changes after a global pandemic).

Don’t just log the prediction; log the features that led to the prediction. Use a background task to push these to a tool like Prometheus or an ELK stack. For Prometheus, you can use histograms to track the distribution of your prediction values. If the mean of your predictions moves by more than two standard deviations over an hour, you need an alert.

from prometheus_client import Histogram, Counter

PREDICTION_VALUE = Histogram(
    'model_prediction_output', 
    'Distribution of model predictions',
    buckets=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
PREDICTION_COUNTER = Counter('model_requests_total', 'Total requests to the model')

def predict(data):
    prediction = model.predict(data)
    PREDICTION_VALUE.observe(prediction)
    PREDICTION_COUNTER.inc()
    return prediction

But Prometheus isn’t enough for deep statistical analysis. You need a system that compares the serving distribution to the training distribution using something like the Kolmogorov-Smirnov test. If you’re not doing this, you’re just guessing. I’ve seen a model’s accuracy drop from 90% to 45% over a weekend because a frontend change started sending a country_code as “US” instead of “United States,” and the model had only been trained on the latter. The system was “healthy” according to every dashboard, but the business was losing money.

The GPU Tax: Why Your Infrastructure is Crying

If you’re running machine learning on GPUs in Kubernetes, you’ve entered a specific kind of YAML-hell. First, there’s the driver versioning. Your host OS needs a specific NVIDIA driver, your Docker image needs a specific version of CUDA Toolkit, and your Python library (like torch) needs to be compiled against that exact CUDA version. If they don’t match, you get the dreaded RuntimeError: CUDA error: device-side assert triggered or simply CUDA out of memory.

GPU memory is not like CPU memory. It doesn’t swap. When it’s full, it’s full. And Python’s memory management doesn’t help here. PyTorch and TensorFlow both use internal memory allocators that reserve large chunks of VRAM upfront. This makes monitoring “actual” usage difficult. You need the nvidia-device-plugin for K8s to even expose GPUs as resources. And please, use resources.limits.nvidia.com/gpu: 1. Do not try to share GPUs between pods unless you are using NVIDIA’s Multi-Instance GPU (MIG) technology. Trying to do software-level GPU sharing is a recipe for non-deterministic latency and random crashes.

Also, consider the cost. A single p3.2xlarge instance on AWS costs about $3/hour. If you have a cluster of 10 of these running 24/7 for an inference service that only gets 5 requests per second, you are burning money. Look into “Serverless” GPU options or, better yet, optimize your model to run on CPUs using OpenVINO or ONNX Runtime. You’d be surprised how often a well-optimized quantized model (INT8) can outperform a FP32 model on a GPU for inference tasks.

The “Real World” Gotcha: The Cold Start and the Health Check

Here is a mistake I see every senior engineer make at least once when they move into machine learning. They set up a standard Kubernetes livenessProbe and readinessProbe. The readinessProbe hits /health, which returns 200 OK. But the model takes 45 seconds to load from S3 into memory. The pod starts, the API is “up,” but the first 10 requests fail because the model_object is still None.

Or worse: the livenessProbe is too aggressive. If the model is performing a heavy inference that takes 2 seconds, and your livenessProbe timeout is 1 second, Kubernetes will think the pod is dead and kill it while it’s working. This leads to a death spiral where pods are killed, restarted, load the model (taking 45 seconds), handle one request, and get killed again.

# Kubernetes probe configuration for a heavy ML model
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60  # Give the model time to load
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 90
  timeoutSeconds: 5        # Don't be too aggressive
  failureThreshold: 3

And then there’s the “Zombie Process” issue. If you’re using multiprocessing for data loading (common in PyTorch DataLoaders), and your main process crashes, sometimes the child processes don’t die. They stay in memory, holding onto GPU handles. Eventually, your node is full of zombie processes, and no new pods can start. Always use a proper init system like tini in your Docker containers to reap these zombies.

CI/CD for ML: Shadow Deployments are Mandatory

You cannot “unit test” a machine learning model’s accuracy. You can test the code that calls the model, but the model’s behavior is probabilistic. This is why Shadow Deployments (or “Dark Launches”) are non-negotiable for machine learning. When you have a new model, you don’t replace the old one. You deploy the new model alongside the old one. Your API sends the request to both models, returns the old model’s result to the user, but logs the new model’s result to a database.

After a week, you compare the results. Did the new model predict “fraud” on the same cases? Was its latency acceptable? Did it crash on edge cases the old model handled? Only after you’ve analyzed the “shadow” data do you flip the switch. This is the only way to sleep at night when you’re deploying machine learning to a system that handles real money.

I once skipped this step for a “minor” update to a recommendation engine. We thought it was a safe change. It turned out the new model had a bias toward suggesting high-margin items that were out of stock. We didn’t catch it in offline testing because our test dataset only included in-stock items. In production, the conversion rate plummeted. If we had run it in shadow mode, we would have seen the discrepancy in five minutes.

Quantization and the Fallacy of Precision

Data scientists love 64-bit floats. They want maximum precision. In production, 64-bit floats are a waste of space. Most models can be quantized to 16-bit (FP16) or even 8-bit (INT8) with negligible loss in accuracy. This reduces your memory footprint by 50-75% and speeds up inference significantly on modern hardware.

But quantization isn’t a “free lunch.” If you quantize a model without “quantization-aware training,” you might introduce weird artifacts. For example, a model that predicts a probability might suddenly only output values like 0.0, 0.2, 0.4, etc., because the underlying weights don’t have the resolution to represent the nuances. You need to validate the quantized model against a “Golden Dataset” before you even think about pushing it to a staging environment.

# Simple example of post-training quantization with PyTorch
import torch

# Load your FP32 model
model_fp32 = MyModel()
model_fp32.load_state_dict(torch.load("model.pth"))
model_fp32.eval()

# Quantize to INT8
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32, 
    {torch.nn.Linear}, 
    dtype=torch.qint8
)

# Save the much smaller model
torch.save(model_int8.state_dict(), "model_int8.pth")

The difference in file size can be the difference between a 500MB model and a 120MB model. That’s less time pulling from S3, less time in the readinessProbe, and more pods you can fit on a single node. In the world of machine learning, efficiency is a feature, not an afterthought.

The Real World: Handling “Out of Distribution” Inputs

Your model is a black box that expects a specific shape of data. What happens when it gets something else? Most ML code just crashes. A NaN in a feature vector can propagate through a neural network until the final output is NaN. If your API then tries to cast that NaN to an integer or use it in a database query, everything breaks.

You need a “Sanity Layer” before the model. This layer checks for:

  • Missing values (and fills them with a safe default).
  • Extreme outliers (e.g., an “age” feature of 10,000).
  • Invalid categorical values (e.g., a “country” code of “ZZ”).

Don’t trust the upstream service to send clean data. They won’t. They’ll change their schema, their validation logic will fail, or a bug in the frontend will start sending strings instead of floats. Your inference service must be defensive. Log the “Out of Distribution” (OOD) events. If 10% of your traffic is OOD, your model is essentially guessing, and you need to alert the team.

Stop treating machine learning like a magical black box and start treating it like what it actually is: a brittle, resource-hungry, non-deterministic binary that requires more monitoring than any other part of your stack. If you can’t explain how your model fails, you shouldn’t be running it in production. Period.

Related Articles

Explore more insights and best practices:

Leave a Comment