{"id":4823,"date":"2026-06-24T22:44:10","date_gmt":"2026-06-24T17:14:10","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/"},"modified":"2026-06-24T22:44:10","modified_gmt":"2026-06-24T17:14:10","slug":"machine-learning-best-practices-guide-2","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/","title":{"rendered":"Machine Learning Best Practices &#8211; Guide"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a3c4692f2eb7\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a3c4692f2eb7\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#Your_Machine_Learning_Model_is_a_Memory_Leak_Waiting_to_Happen\" >Your Machine Learning Model is a Memory Leak Waiting to Happen<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#The_Environment_Why_Your_Dockerfile_is_a_Liability\" >The Environment: Why Your Dockerfile is a Liability<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#The_Data_Pipeline_Versioning_is_Not_Just_for_Code\" >The Data Pipeline: Versioning is Not Just for Code<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#Serialization_Pickle_is_a_Security_Risk_and_a_Versioning_Nightmare\" >Serialization: Pickle is a Security Risk and a Versioning Nightmare<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#The_API_Layer_FastAPI_and_the_Pydantic_Tax\" >The API Layer: FastAPI and the Pydantic Tax<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#Monitoring_200_OK_Does_Not_Mean_the_Model_is_Working\" >Monitoring: 200 OK Does Not Mean the Model is Working<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#The_GPU_Tax_Why_Your_Infrastructure_is_Crying\" >The GPU Tax: Why Your Infrastructure is Crying<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#The_%E2%80%9CReal_World%E2%80%9D_Gotcha_The_Cold_Start_and_the_Health_Check\" >The &#8220;Real World&#8221; Gotcha: The Cold Start and the Health Check<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#CICD_for_ML_Shadow_Deployments_are_Mandatory\" >CI\/CD for ML: Shadow Deployments are Mandatory<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#Quantization_and_the_Fallacy_of_Precision\" >Quantization and the Fallacy of Precision<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#The_Real_World_Handling_%E2%80%9COut_of_Distribution%E2%80%9D_Inputs\" >The Real World: Handling &#8220;Out of Distribution&#8221; Inputs<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Your_Machine_Learning_Model_is_a_Memory_Leak_Waiting_to_Happen\"><\/span>Your Machine Learning Model is a Memory Leak Waiting to Happen<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I once spent seventy-two hours straight debugging a &#8220;ghost in the machine&#8221; that was costing a fintech client $4,200 every hour. We had just pushed a new credit-scoring model to production. On paper, the metrics were flawless\u201498% precision, great recall, and the data scientists were high-fiving in the Slack channel. But thirty minutes after the <code>kubectl apply<\/code> finished, the nodes started screaming. The Kubelet was OOM-killing the inference pods, but the memory usage reported by the Python process didn&#8217;t explain why. We were using a standard <code>pickle<\/code> load of a 2GB Random Forest model, and for some reason, the resident set size (RSS) was ballooning to 12GB per pod.<\/p>\n<p>The culprit? A combination of Python\u2019s copy-on-write behavior during multiprocessing and a massive feature matrix that was being duplicated across every worker thread. We hadn&#8217;t accounted for the overhead of the <code>gunicorn<\/code> workers pre-loading the model into shared memory incorrectly. I ended up rewriting the loading logic to use <code>numpy.memmap<\/code> so the workers could read the model weights directly from disk without sucking the RAM dry. It was a messy, low-level fix for a &#8220;high-level&#8221; technology. That\u2019s the reality of <b>machine learning<\/b> in production: it\u2019s 10% math and 90% fighting with Linux primitives and garbage collection.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Environment_Why_Your_Dockerfile_is_a_Liability\"><\/span>The Environment: Why Your Dockerfile is a Liability<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Most machine learning tutorials tell you to start with <code>FROM python:3.9<\/code> or, god forbid, <code>FROM alpine<\/code>. If you use Alpine for ML, you are signing up for a world of hurt. Machine learning libraries like <code>numpy<\/code>, <code>scipy<\/code>, and <code>pandas<\/code> rely heavily on C extensions. Alpine uses <code>musl<\/code> instead of <code>glibc<\/code>. When you <code>pip install<\/code> these libraries on Alpine, you can&#8217;t use the pre-compiled wheels. Your CI\/CD pipeline will spend forty minutes compiling C++ code from source, only to fail because some obscure header file is missing. Use <code>python:3.11-slim-bookworm<\/code>. It\u2019s Debian-based, it has <code>glibc<\/code>, and the image size is small enough to not choke your container registry.<\/p>\n<p>Then there\u2019s the dependency hell. If I see one more <code>requirements.txt<\/code> with <code>scikit-learn>=1.0<\/code>, I\u2019m going to retire and become a carpenter. In ML, a minor version bump in a dependency can change the default behavior of an optimizer, silently breaking your model&#8217;s predictions without throwing a single error. You need deterministic builds. Use <code>poetry.lock<\/code> or <code>pip-compile<\/code> from <code>pip-tools<\/code>. You need to know exactly which version of <code>threadpoolctl<\/code> is running in your container.<\/p>\n<blockquote><p>\n    <strong>Pro-tip:<\/strong> Always set <code>OMP_NUM_THREADS=1<\/code> in your Dockerfile environment variables. Many ML libraries try to parallelize operations using OpenMP. If your container is restricted to 2 CPUs by Kubernetes but the library sees 64 cores on the host, it will spawn 64 threads, leading to massive context-switching overhead and degraded performance.\n<\/p><\/blockquote>\n<pre><code># A sane Dockerfile for ML inference\nFROM python:3.11-slim-bookworm\n\nRUN apt-get update && apt-get install -y --no-install-recommends \\\n    build-essential \\\n    libgomp1 \\\n    && rm -rf \/var\/lib\/apt\/lists\/*\n\nWORKDIR \/app\nCOPY pyproject.toml poetry.lock \/app\/\nRUN pip install --no-cache-dir poetry && \\\n    poetry config virtualenvs.create false && \\\n    poetry install --only main --no-interaction --no-ansi\n\nCOPY .\/src \/app\/src\nENV OMP_NUM_THREADS=1\nENV PYTHONUNBUFFERED=1\n\nCMD [\"gunicorn\", \"-k\", \"uvicorn.workers.UvicornWorker\", \"--workers\", \"4\", \"--bind\", \"0.0.0.0:8000\", \"src.main:app\"]\n<\/code><\/pre>\n<h2><span class=\"ez-toc-section\" id=\"The_Data_Pipeline_Versioning_is_Not_Just_for_Code\"><\/span>The Data Pipeline: Versioning is Not Just for Code<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In traditional software, Git is enough. In machine learning, Git is a joke. You cannot commit a 50GB CSV to a repository, and you certainly shouldn&#8217;t be pulling data from a <code>S3<\/code> bucket using a &#8220;latest&#8221; tag. If you can&#8217;t recreate the exact dataset used to train model <code>v2.4.1<\/code>, you don&#8217;t have a production system; you have a science project. I\u2019ve seen teams lose weeks of work because they &#8220;cleaned&#8221; a table in Snowflake and didn&#8217;t realize it changed the distribution of a feature used by a live model.<\/p>\n<ul>\n<li><b>DVC (Data Version Control):<\/b> Treat your data like code. DVC creates metadata files that you <i>can<\/i> commit to Git, which point to specific versions of files in S3 or GCS. It\u2019s the only way to maintain sanity.<\/li>\n<li><b>Feature Stores:<\/b> If you\u2019re calculating &#8220;average spend in last 24 hours&#8221; in a SQL query for training and in a Python loop for inference, you\u2019ve already failed. The logic will diverge. Use a feature store like Feast or just a shared library that handles the transformation for both paths.<\/li>\n<\/ul>\n<p>The &#8220;Training-Serving Skew&#8221; is the silent killer. You train on a snapshot of a database where nulls were filled with the mean. In production, the API receives a <code>null<\/code>, and your code throws a <code>ValueError<\/code> because the production environment doesn&#8217;t have the &#8220;mean&#8221; value from three months ago cached anywhere. You must bundle your preprocessing parameters (the scalers, the encoders) with the model itself. If you use <code>scikit-learn<\/code>, use a <code>Pipeline<\/code> object. Don&#8217;t export the model and the scaler as two separate files. They are a single unit of execution.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Serialization_Pickle_is_a_Security_Risk_and_a_Versioning_Nightmare\"><\/span>Serialization: Pickle is a Security Risk and a Versioning Nightmare<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We need to talk about <code>pickle<\/code>. It is the default way to save Python objects, and it is terrible for machine learning. First, it\u2019s insecure. Loading a pickle file can execute arbitrary code. If someone compromises your S3 bucket and replaces your model with a malicious pickle, they have RCE (Remote Code Execution) on your inference nodes. Second, pickle is tied to the Python class structure. If you rename a class in <code>src\/models\/classifier.py<\/code>, your old models won&#8217;t load anymore.<\/p>\n<p>Instead, look at <code>ONNX<\/code> (Open Neural Network Exchange). It\u2019s a cross-platform format. You can train a model in PyTorch and run it in a high-performance C++ runtime or even in the browser. It forces you to define your inputs and outputs strictly. If you must stay in Python-land, use <code>joblib<\/code> with <code>mmap_mode='r'<\/code> for large arrays, but be aware of the versioning constraints. For deep learning, <code>Safetensors<\/code> from the Hugging Face team is the current gold standard because it prevents the RCE risks of pickle and is incredibly fast at loading.<\/p>\n<pre><code># Example of exporting to ONNX to avoid pickle hell\nimport torch\nimport torch.onnx\n\ndef export_model(model, dummy_input, path=\"model.onnx\"):\n    model.eval()\n    torch.onnx.export(\n        model, \n        dummy_input, \n        path, \n        export_params=True, \n        opset_version=12, \n        do_constant_folding=True, \n        input_names=['input'], \n        output_names=['output'],\n        dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}\n    )\n    print(f\"Model exported to {path}\")\n\n# In production, use onnxruntime\nimport onnxruntime as ort\nsession = ort.InferenceSession(\"model.onnx\")\nresults = session.run(None, {\"input\": input_numpy_array})\n<\/code><\/pre>\n<h2><span class=\"ez-toc-section\" id=\"The_API_Layer_FastAPI_and_the_Pydantic_Tax\"><\/span>The API Layer: FastAPI and the Pydantic Tax<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Everyone uses FastAPI now. It\u2019s great. It\u2019s fast. But people misuse it in machine learning contexts. They define these massive Pydantic models for the request body, which is fine for a CRUD app, but when you\u2019re sending a 1000-element vector as a JSON list, Pydantic\u2019s validation becomes a massive bottleneck. I\u2019ve seen Pydantic validation take longer than the actual model inference.<\/p>\n<p>If you are dealing with high-throughput machine learning, stop sending raw JSON arrays. Use Protobuf or even just a binary blob if you can. If you must use JSON, use <code>ujson<\/code> or <code>orjson<\/code> as the response class. Also, for the love of all that is holy, do not run your model inference directly in the <code>async def<\/code> endpoint. Most ML libraries are CPU-bound and do not play nice with Python&#8217;s <code>asyncio<\/code> loop. They will block the loop, and your &#8220;high-performance&#8221; API will handle exactly one request at a time.<\/p>\n<p>Use <code>starlette.concurrency.run_in_threadpool<\/code> or just define your endpoint with <code>def<\/code> instead of <code>async def<\/code> so FastAPI runs it in a separate thread. Better yet, use a dedicated inference server like NVIDIA Triton or TorchServe if you\u2019re at scale. They handle batching, model versioning, and GPU memory management much better than a custom FastAPI wrapper ever will.<\/p>\n<blockquote><p>\n    <strong>Note to self:<\/strong> When using FastAPI with Gunicorn, remember that <code>--workers<\/code> should usually be <code>(2 x $num_cores) + 1<\/code>, but for ML, this is often too many. ML models are heavy. If each worker loads a 4GB model, and you have 16 cores, you&#8217;ll need 132GB of RAM just for the workers. Calculate your memory overhead before you scale the worker count.\n<\/p><\/blockquote>\n<h2><span class=\"ez-toc-section\" id=\"Monitoring_200_OK_Does_Not_Mean_the_Model_is_Working\"><\/span>Monitoring: 200 OK Does Not Mean the Model is Working<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In SRE, we care about the &#8220;Golden Signals&#8221;: Latency, Traffic, Errors, and Saturation. In machine learning, you can have perfect Golden Signals while your model is spitting out absolute garbage. This is called &#8220;Silent Failure.&#8221; Your API returns a 200 OK, the latency is a crisp 40ms, but the model is predicting that every single customer is a fraudster because the input distribution shifted.<\/p>\n<p>You need to monitor <b>Model Drift<\/b> and <b>Concept Drift<\/b>. Model drift happens when the data your model sees in production is different from the data it was trained on. Concept drift happens when the relationship between the input and the output changes (e.g., consumer behavior changes after a global pandemic).<\/p>\n<p>Don&#8217;t just log the prediction; log the features that led to the prediction. Use a background task to push these to a tool like Prometheus or an ELK stack. For Prometheus, you can use histograms to track the distribution of your prediction values. If the mean of your predictions moves by more than two standard deviations over an hour, you need an alert.<\/p>\n<pre><code>from prometheus_client import Histogram, Counter\n\nPREDICTION_VALUE = Histogram(\n    'model_prediction_output', \n    'Distribution of model predictions',\n    buckets=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\n)\nPREDICTION_COUNTER = Counter('model_requests_total', 'Total requests to the model')\n\ndef predict(data):\n    prediction = model.predict(data)\n    PREDICTION_VALUE.observe(prediction)\n    PREDICTION_COUNTER.inc()\n    return prediction\n<\/code><\/pre>\n<p>But Prometheus isn&#8217;t enough for deep statistical analysis. You need a system that compares the serving distribution to the training distribution using something like the Kolmogorov-Smirnov test. If you\u2019re not doing this, you\u2019re just guessing. I\u2019ve seen a model&#8217;s accuracy drop from 90% to 45% over a weekend because a frontend change started sending a <code>country_code<\/code> as &#8220;US&#8221; instead of &#8220;United States,&#8221; and the model had only been trained on the latter. The system was &#8220;healthy&#8221; according to every dashboard, but the business was losing money.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_GPU_Tax_Why_Your_Infrastructure_is_Crying\"><\/span>The GPU Tax: Why Your Infrastructure is Crying<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If you&#8217;re running machine learning on GPUs in Kubernetes, you&#8217;ve entered a specific kind of YAML-hell. First, there\u2019s the driver versioning. Your host OS needs a specific NVIDIA driver, your Docker image needs a specific version of CUDA Toolkit, and your Python library (like <code>torch<\/code>) needs to be compiled against that exact CUDA version. If they don&#8217;t match, you get the dreaded <code>RuntimeError: CUDA error: device-side assert triggered<\/code> or simply <code>CUDA out of memory<\/code>.<\/p>\n<p>GPU memory is not like CPU memory. It doesn&#8217;t swap. When it&#8217;s full, it\u2019s full. And Python\u2019s memory management doesn&#8217;t help here. PyTorch and TensorFlow both use internal memory allocators that reserve large chunks of VRAM upfront. This makes monitoring &#8220;actual&#8221; usage difficult. You need the <code>nvidia-device-plugin<\/code> for K8s to even expose GPUs as resources. And please, use <code>resources.limits.nvidia.com\/gpu: 1<\/code>. Do not try to share GPUs between pods unless you are using NVIDIA&#8217;s Multi-Instance GPU (MIG) technology. Trying to do software-level GPU sharing is a recipe for non-deterministic latency and random crashes.<\/p>\n<p>Also, consider the cost. A single <code>p3.2xlarge<\/code> instance on AWS costs about $3\/hour. If you have a cluster of 10 of these running 24\/7 for an inference service that only gets 5 requests per second, you are burning money. Look into &#8220;Serverless&#8221; GPU options or, better yet, optimize your model to run on CPUs using OpenVINO or ONNX Runtime. You\u2019d be surprised how often a well-optimized quantized model (INT8) can outperform a FP32 model on a GPU for inference tasks.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_%E2%80%9CReal_World%E2%80%9D_Gotcha_The_Cold_Start_and_the_Health_Check\"><\/span>The &#8220;Real World&#8221; Gotcha: The Cold Start and the Health Check<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Here is a mistake I see every senior engineer make at least once when they move into machine learning. They set up a standard Kubernetes <code>livenessProbe<\/code> and <code>readinessProbe<\/code>. The <code>readinessProbe<\/code> hits <code>\/health<\/code>, which returns 200 OK. But the model takes 45 seconds to load from S3 into memory. The pod starts, the API is &#8220;up,&#8221; but the first 10 requests fail because the <code>model_object<\/code> is still <code>None<\/code>.<\/p>\n<p>Or worse: the <code>livenessProbe<\/code> is too aggressive. If the model is performing a heavy inference that takes 2 seconds, and your <code>livenessProbe<\/code> timeout is 1 second, Kubernetes will think the pod is dead and kill it <i>while it&#8217;s working<\/i>. This leads to a death spiral where pods are killed, restarted, load the model (taking 45 seconds), handle one request, and get killed again.<\/p>\n<pre><code># Kubernetes probe configuration for a heavy ML model\nreadinessProbe:\n  httpGet:\n    path: \/health\n    port: 8000\n  initialDelaySeconds: 60  # Give the model time to load\n  periodSeconds: 10\nlivenessProbe:\n  httpGet:\n    path: \/health\n    port: 8000\n  initialDelaySeconds: 90\n  timeoutSeconds: 5        # Don't be too aggressive\n  failureThreshold: 3\n<\/code><\/pre>\n<p>And then there&#8217;s the &#8220;Zombie Process&#8221; issue. If you&#8217;re using <code>multiprocessing<\/code> for data loading (common in PyTorch <code>DataLoaders<\/code>), and your main process crashes, sometimes the child processes don&#8217;t die. They stay in memory, holding onto GPU handles. Eventually, your node is full of zombie processes, and no new pods can start. Always use a proper init system like <code>tini<\/code> in your Docker containers to reap these zombies.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"CICD_for_ML_Shadow_Deployments_are_Mandatory\"><\/span>CI\/CD for ML: Shadow Deployments are Mandatory<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>You cannot &#8220;unit test&#8221; a machine learning model&#8217;s accuracy. You can test the code that calls the model, but the model&#8217;s behavior is probabilistic. This is why <b>Shadow Deployments<\/b> (or &#8220;Dark Launches&#8221;) are non-negotiable for machine learning. When you have a new model, you don&#8217;t replace the old one. You deploy the new model alongside the old one. Your API sends the request to both models, returns the old model&#8217;s result to the user, but logs the new model&#8217;s result to a database.<\/p>\n<p>After a week, you compare the results. Did the new model predict &#8220;fraud&#8221; on the same cases? Was its latency acceptable? Did it crash on edge cases the old model handled? Only after you&#8217;ve analyzed the &#8220;shadow&#8221; data do you flip the switch. This is the only way to sleep at night when you&#8217;re deploying <b>machine learning<\/b> to a system that handles real money.<\/p>\n<p>I once skipped this step for a &#8220;minor&#8221; update to a recommendation engine. We thought it was a safe change. It turned out the new model had a bias toward suggesting high-margin items that were out of stock. We didn&#8217;t catch it in offline testing because our test dataset only included in-stock items. In production, the conversion rate plummeted. If we had run it in shadow mode, we would have seen the discrepancy in five minutes.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Quantization_and_the_Fallacy_of_Precision\"><\/span>Quantization and the Fallacy of Precision<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Data scientists love 64-bit floats. They want maximum precision. In production, 64-bit floats are a waste of space. Most models can be quantized to 16-bit (FP16) or even 8-bit (INT8) with negligible loss in accuracy. This reduces your memory footprint by 50-75% and speeds up inference significantly on modern hardware.<\/p>\n<p>But quantization isn&#8217;t a &#8220;free lunch.&#8221; If you quantize a model without &#8220;quantization-aware training,&#8221; you might introduce weird artifacts. For example, a model that predicts a probability might suddenly only output values like 0.0, 0.2, 0.4, etc., because the underlying weights don&#8217;t have the resolution to represent the nuances. You need to validate the quantized model against a &#8220;Golden Dataset&#8221; before you even think about pushing it to a staging environment.<\/p>\n<pre><code># Simple example of post-training quantization with PyTorch\nimport torch\n\n# Load your FP32 model\nmodel_fp32 = MyModel()\nmodel_fp32.load_state_dict(torch.load(\"model.pth\"))\nmodel_fp32.eval()\n\n# Quantize to INT8\nmodel_int8 = torch.quantization.quantize_dynamic(\n    model_fp32, \n    {torch.nn.Linear}, \n    dtype=torch.qint8\n)\n\n# Save the much smaller model\ntorch.save(model_int8.state_dict(), \"model_int8.pth\")\n<\/code><\/pre>\n<p>The difference in file size can be the difference between a 500MB model and a 120MB model. That&#8217;s less time pulling from S3, less time in the <code>readinessProbe<\/code>, and more pods you can fit on a single node. In the world of <b>machine learning<\/b>, efficiency is a feature, not an afterthought.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Real_World_Handling_%E2%80%9COut_of_Distribution%E2%80%9D_Inputs\"><\/span>The Real World: Handling &#8220;Out of Distribution&#8221; Inputs<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Your model is a black box that expects a specific shape of data. What happens when it gets something else? Most ML code just crashes. A <code>NaN<\/code> in a feature vector can propagate through a neural network until the final output is <code>NaN<\/code>. If your API then tries to cast that <code>NaN<\/code> to an integer or use it in a database query, everything breaks.<\/p>\n<p>You need a &#8220;Sanity Layer&#8221; before the model. This layer checks for:<\/p>\n<ul>\n<li>Missing values (and fills them with a safe default).<\/li>\n<li>Extreme outliers (e.g., an &#8220;age&#8221; feature of 10,000).<\/li>\n<li>Invalid categorical values (e.g., a &#8220;country&#8221; code of &#8220;ZZ&#8221;).<\/li>\n<\/ul>\n<p>Don&#8217;t trust the upstream service to send clean data. They won&#8217;t. They&#8217;ll change their schema, their validation logic will fail, or a bug in the frontend will start sending strings instead of floats. Your inference service must be defensive. Log the &#8220;Out of Distribution&#8221; (OOD) events. If 10% of your traffic is OOD, your model is essentially guessing, and you need to alert the team.<\/p>\n<p>Stop treating machine learning like a magical black box and start treating it like what it actually is: a brittle, resource-hungry, non-deterministic binary that requires more monitoring than any other part of your stack. If you can&#8217;t explain how your model fails, you shouldn&#8217;t be running it in production. Period.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/10-devops-best-practices-to-streamline-your-workflow\/\">10 Devops Best Practices To Streamline Your Workflow<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/what-is-a-docker-container-a-complete-guide-for-beginners\/\">What Is A Docker Container A Complete Guide For Beginners<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/what-is-devops-definition-benefits-and-best-practices\/\">What Is Devops Definition Benefits And Best Practices<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Your Machine Learning Model is a Memory Leak Waiting to Happen I once spent seventy-two hours straight debugging a &#8220;ghost in the machine&#8221; that was costing a fintech client $4,200 every hour. We had just pushed a new credit-scoring model to production. On paper, the metrics were flawless\u201498% precision, great recall, and the data scientists &#8230; <a title=\"Machine Learning Best Practices &#8211; Guide\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\" aria-label=\"Read more  on Machine Learning Best Practices &#8211; Guide\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4823","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Machine Learning Best Practices - Guide - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Machine Learning Best Practices - Guide - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"Your Machine Learning Model is a Memory Leak Waiting to Happen I once spent seventy-two hours straight debugging a &#8220;ghost in the machine&#8221; that was costing a fintech client $4,200 every hour. We had just pushed a new credit-scoring model to production. On paper, the metrics were flawless\u201498% precision, great recall, and the data scientists ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-24T17:14:10+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"Machine Learning Best Practices &#8211; Guide\",\"datePublished\":\"2026-06-24T17:14:10+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\"},\"wordCount\":2552,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\",\"name\":\"Machine Learning Best Practices - Guide - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-06-24T17:14:10+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Machine Learning Best Practices &#8211; Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Machine Learning Best Practices - Guide - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/","og_locale":"en_US","og_type":"article","og_title":"Machine Learning Best Practices - Guide - ITSupportWale","og_description":"Your Machine Learning Model is a Memory Leak Waiting to Happen I once spent seventy-two hours straight debugging a &#8220;ghost in the machine&#8221; that was costing a fintech client $4,200 every hour. We had just pushed a new credit-scoring model to production. On paper, the metrics were flawless\u201498% precision, great recall, and the data scientists ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-06-24T17:14:10+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"Machine Learning Best Practices &#8211; Guide","datePublished":"2026-06-24T17:14:10+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/"},"wordCount":2552,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/","url":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/","name":"Machine Learning Best Practices - Guide - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-06-24T17:14:10+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-guide-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Machine Learning Best Practices &#8211; Guide"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4823","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4823"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4823\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4823"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4823"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4823"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}