{"id":4787,"date":"2026-05-12T22:49:42","date_gmt":"2026-05-12T17:19:42","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/"},"modified":"2026-05-12T22:49:42","modified_gmt":"2026-05-12T17:19:42","slug":"machine-learning-best-practices-a-guide-to-success","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/","title":{"rendered":"Machine Learning Best Practices: A Guide to Success"},"content":{"rendered":"<p>[2023-10-14 03:14:22.891] ERROR: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.50 GiB (GPU 0; 15.78 GiB total capacity; 11.20 GiB already allocated; 2.45 GiB free; 12.10 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF<br \/>\n[2023-10-14 03:14:22.892] TRACEBACK:<br \/>\n  File &#8220;\/opt\/venv\/lib\/python3.10\/site-packages\/torch\/nn\/modules\/module.py&#8221;, line 1501, in _call_impl<br \/>\n    return forward_call(<em>args, <\/em>*kwargs)<br \/>\n  File &#8220;\/app\/models\/transformer_v4_final_FINAL_v2.py&#8221;, line 442, in forward<br \/>\n    x = self.attention(x)<br \/>\n  File &#8220;\/app\/utils\/dirty_hacks.py&#8221;, line 12, in attention<br \/>\n    return torch.matmul(q, k.transpose(-2, -1)) \/ math.sqrt(d_k)<br \/>\n[2023-10-14 03:14:23.004] CRITICAL: Worker process (PID 4421) exited with code 1.<br \/>\n[2023-10-14 03:14:23.005] MONITORING: Alert &#8216;Inference_Service_Down&#8217; fired. Severity: P0.<br \/>\n[2023-10-14 03:14:23.005] LOG: Attempting to dump local variables to \/tmp\/crash_dump_031422.json&#8230;<br \/>\n[2023-10-14 03:14:23.006] ERROR: [Errno 28] No space left on device: &#8216;\/tmp\/crash_dump_031422.json&#8217;<\/p>\n<pre class=\"codehilite\"><code>I didn't get a call at 3:14 AM because the system was &quot;intelligent.&quot; I got a call because the system was a bloated, unmonitored corpse of a project that finally stopped twitching. \n\nFor six months, the &quot;Research Team&quot; had been patting themselves on the back for their &quot;machine learning&quot; breakthroughs. They had a Jupyter Notebook that produced a beautiful ROC curve. They had a slide deck that promised 99.2% accuracy on fraud detection. What they didn't have was a single line of production-ready code, a stable environment, or any understanding of how data actually moves through a network.\n\nThis is the post-mortem of Project Sentinel. It\u2019s a story about how &quot;machine learning&quot; is 10% math and 90% plumbing\u2014and how, if you ignore the plumbing, you end up drowning in technical debt and broken CUDA kernels.\n\n## The Dependency Hell of Python 3.8.10\n\nThe nightmare started with the environment. Or rather, the lack of one. When I was handed the repository, the `README.md` was a single sentence: &quot;Run the notebook.&quot; There was no `requirements.txt`, no `pyproject.toml`, and certainly no `Dockerfile`. \n\nI spent the first forty-eight hours trying to reconstruct the environment. The researchers had been using a mix of `conda` and `pip` on their local MacBooks, installing packages at random. When I finally managed to extract a `pip freeze` from one of their machines, it looked like a suicide note.\n\n```text\n# Partial output from the 'research_env_v1'\nnumpy==1.21.0\npandas==1.3.5\ntorch==1.10.0+cu111\ntorchvision==0.11.1+cu111\nscikit-learn==0.24.2\nscipy==1.7.3\nmatplotlib==3.4.3\n# Why is this here?\ntensorflow-gpu==2.6.0\n# And this?\nprotobuf==3.19.1\n# This version of urllib3 has a known CVE\nurllib3==1.26.7\n<\/code><\/pre>\n<p>The first thing I did was burn it down. You cannot build a reliable &#8220;machine learning&#8221; pipeline on Python 3.8.10 in 2023. We migrated to Python 3.11.4. Why? Because the performance improvements in the runtime are non-negotiable when you\u2019re processing 50,000 events per second. <\/p>\n<p>We moved to PyTorch 2.1.0 to leverage the <code>torch.compile<\/code> feature, which promised a 20% speedup on our inference kernels. But moving versions isn&#8217;t just about changing a number in a file. It\u2019s about the cascading failures of every sub-dependency. <code>protobuf 3.20.3<\/code> doesn&#8217;t like <code>tensorflow 2.14.0<\/code> unless you pin the specific C++ implementation. <code>scikit-learn 1.3.0<\/code> changed the way it handles certain array inputs, breaking the preprocessing scripts that the researchers had &#8220;carefully&#8221; crafted.<\/p>\n<p>If you aren&#8217;t pinning your versions to the third decimal point, you aren&#8217;t doing &#8220;machine learning&#8221;; you&#8217;re playing Russian Roulette with a fully loaded cylinder.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a5ff639113bc\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a5ff639113bc\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#Silent_Failures_in_the_Feature_Store\" >Silent Failures in the Feature Store<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#The_Pickling_Nightmare_and_Serialization_Debt\" >The Pickling Nightmare and Serialization Debt<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#Infrastructure_as_an_Afterthought_NVIDIA_Driver_53510405\" >Infrastructure as an Afterthought: NVIDIA Driver 535.104.05<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#The_Logging_Void_and_the_Death_of_Reproducibility\" >The Logging Void and the Death of Reproducibility<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#Post-Mortem_Remediation_The_Plumbing_Manifesto\" >Post-Mortem Remediation: The Plumbing Manifesto<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#The_Nuances_of_ONNX_Exports_and_Runtime_Optimization\" >The Nuances of ONNX Exports and Runtime Optimization<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#Managing_the_GPU_Driver_Nightmare\" >Managing the GPU Driver Nightmare<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#Conclusion_The_Math_is_the_Easy_Part\" >Conclusion: The Math is the Easy Part<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Silent_Failures_in_the_Feature_Store\"><\/span>Silent Failures in the Feature Store<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Once the environment was stable enough to actually run a script without a <code>ModuleNotFoundError<\/code>, we hit the data. This is where the &#8220;math&#8221; people usually check out. They assume the data is a static CSV file that exists in a vacuum. <\/p>\n<p>In reality, the data was a streaming mess of JSON blobs coming from a Kafka 3.6.0 cluster. The researchers had trained their model on a &#8220;cleaned&#8221; dataset. When I looked at the cleaning script, I found the smoking gun: data leakage. They were using the <code>target_variable<\/code> to calculate a rolling mean of the <code>transaction_amount<\/code> <em>before<\/em> the train-test split. <\/p>\n<p>The model wasn&#8217;t learning to detect fraud. It was learning to read the future.<\/p>\n<p>We had to implement a robust feature store using Redis 7.2.1 for low-latency lookups. Every feature had to be versioned. Every transformation had to be idempotent. We implemented a schema validation layer using Pydantic 2.4.2 to ensure that if a field changed from an <code>int<\/code> to a <code>float<\/code> in the upstream API, the pipeline would fail loudly and immediately rather than silently corrupting the model&#8217;s weights.<\/p>\n<p>Here is what our feature metadata log looked like after we fixed the ingestion:<\/p>\n<pre class=\"codehilite\"><code class=\"language-json\">{\n  &quot;feature_set_version&quot;: &quot;2.4.1&quot;,\n  &quot;timestamp&quot;: &quot;2023-10-14T03:10:00Z&quot;,\n  &quot;source_kafka_topic&quot;: &quot;transactions_v3&quot;,\n  &quot;schema_hash&quot;: &quot;a1b2c3d4e5f6&quot;,\n  &quot;features&quot;: [\n    {&quot;name&quot;: &quot;avg_amount_1h&quot;, &quot;type&quot;: &quot;float64&quot;, &quot;null_count&quot;: 0},\n    {&quot;name&quot;: &quot;geo_velocity&quot;, &quot;type&quot;: &quot;float64&quot;, &quot;null_count&quot;: 12},\n    {&quot;name&quot;: &quot;is_vpn_proxy&quot;, &quot;type&quot;: &quot;bool&quot;, &quot;null_count&quot;: 0}\n  ],\n  &quot;validation_status&quot;: &quot;SUCCESS&quot;\n}\n<\/code><\/pre>\n<p>Without this metadata, you are flying blind. If your &#8220;machine learning&#8221; model starts acting up, the first question isn&#8217;t &#8220;did the weights drift?&#8221; It&#8217;s &#8220;did the definition of &#8216;average_spend&#8217; change in the upstream SQL query?&#8221;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Pickling_Nightmare_and_Serialization_Debt\"><\/span>The Pickling Nightmare and Serialization Debt<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The researchers loved <code>pickle<\/code>. They pickled their models, they pickled their scalers, they even pickled their custom dictionary of hyperparameters. <\/p>\n<p><code>pickle<\/code> is a security disaster and a compatibility nightmare. When we tried to move the model from the training environment (Python 3.11.4) to the inference environment (a slimmed-down Debian Bookworm image), the <code>unpickle<\/code> operation failed. Why? Because one of the custom classes in the model architecture had been moved from <code>utils\/models.py<\/code> to <code>core\/arch.py<\/code>. <\/p>\n<p><code>pickle<\/code> doesn&#8217;t store the code; it stores a reference to the module path. If you change your folder structure, your model is a brick.<\/p>\n<p>We spent three weeks refactoring the entire export process to use ONNX (Open Neural Network Exchange). We used <code>torch.onnx.export<\/code> to convert the PyTorch 2.1.0 models into a format that could be run by <code>onnxruntime 1.16.1<\/code>. This decoupled the model from the Python runtime entirely. <\/p>\n<p>The &#8220;math&#8221; didn&#8217;t change. The weights were the same. But the plumbing changed from a fragile, path-dependent mess to a portable, high-performance artifact. We could now run inference in a C++ environment if we wanted to, bypassing the Python Global Interpreter Lock (GIL) and saving us 15ms of latency per request.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Infrastructure_as_an_Afterthought_NVIDIA_Driver_53510405\"><\/span>Infrastructure as an Afterthought: NVIDIA Driver 535.104.05<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>You haven&#8217;t known pain until you&#8217;ve debugged a CUDA kernel panic at 4:00 AM. <\/p>\n<p>The Research Team had developed the model on local RTX 3090s. Production was running on Tesla T4s in the cloud. They assumed that because it was &#8220;all NVIDIA,&#8221; it would just work. It didn&#8217;t. <\/p>\n<p>The production nodes were running NVIDIA Driver 535.104.05 with CUDA 12.2. The model had been compiled against CUDA 11.8. Usually, there&#8217;s backward compatibility, but a specific optimization in the attention mechanism\u2014a custom Triton kernel\u2014was throwing an <code>invalid device function<\/code> error.<\/p>\n<p>We had to standardize the entire stack. Every developer, every CI runner, and every production node had to be synchronized. We moved to a base Docker image: <code>nvidia\/cuda:12.2.0-base-ubuntu22.04<\/code>. <\/p>\n<p>We also had to deal with the &#8220;OOM&#8221; (Out of Memory) errors. The researchers had set the batch size to 512 because &#8220;it worked on their 24GB cards.&#8221; The T4s only have 16GB. Instead of just lowering the batch size and killing throughput, we had to implement gradient accumulation and mixed-precision training using <code>torch.cuda.amp<\/code>. <\/p>\n<p>This is the reality of &#8220;machine learning&#8221;. It\u2019s not about the elegance of the loss function; it\u2019s about whether your <code>LD_LIBRARY_PATH<\/code> is correctly pointing to <code>libcudnn.so.8<\/code>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Logging_Void_and_the_Death_of_Reproducibility\"><\/span>The Logging Void and the Death of Reproducibility<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When the system crashed (see the log at the top of this post), I went to check the logs. There were none. Or rather, there were millions of lines of <code>print(df.shape)<\/code> and <code>print(\"here1\")<\/code> scattered across various <code>stdout<\/code> streams, but nothing useful.<\/p>\n<p>No one was tracking the hyperparameters. No one was tracking the data version. No one was tracking the system metrics.<\/p>\n<p>We implemented a three-tier logging strategy.<br \/>\n1. <strong>Application Logs:<\/strong> Using the standard Python <code>logging<\/code> module, configured to output JSON for ingestion by an ELK stack.<br \/>\n2. <strong>Experiment Tracking:<\/strong> Using MLflow 2.8.1. Every training run was required to log its git commit hash, its dataset hash, and its specific version of <code>scikit-learn 1.3.0<\/code>.<br \/>\n3. <strong>System Metrics:<\/strong> Using Prometheus and Grafana. We exported custom metrics using the <code>prometheus_client 0.17.1<\/code>.<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\"># prometheus_rules.yml\ngroups:\n  - name: ml_model_alerts\n    rules:\n      - alert: HighInferenceLatency\n        expr: histogram_quantile(0.99, sum(rate(inference_latency_seconds_bucket[5m])) by (le)) &gt; 0.5\n        for: 2m\n        labels:\n          severity: warning\n        annotations:\n          summary: &quot;Inference latency is too high on {{ $labels.instance }}&quot;\n      - alert: PredictionDrift\n        expr: model_prediction_drift_score &gt; 0.15\n        for: 10m\n        labels:\n          severity: critical\n<\/code><\/pre>\n<p>If you can&#8217;t reproduce a result from six months ago using the same code and the same data, you aren&#8217;t doing science. You&#8217;re doing alchemy. And alchemy doesn&#8217;t belong in a production financial system.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Post-Mortem_Remediation_The_Plumbing_Manifesto\"><\/span>Post-Mortem Remediation: The Plumbing Manifesto<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>After the &#8220;Great Crash of October,&#8221; I was given carte blanche to fix the department. I didn&#8217;t hire more researchers. I hired two Site Reliability Engineers (SREs) who knew their way around a Linux kernel and a network switch. <\/p>\n<p>We established the &#8220;Plumbing Manifesto&#8221; for all &#8220;machine learning&#8221; projects:<\/p>\n<ol>\n<li><strong>Isolation is Mandatory:<\/strong> No more local development. Everything happens inside a devcontainer that mirrors the production environment (Python 3.11.4, CUDA 12.2).<\/li>\n<li><strong>Validation is Continuous:<\/strong> We implemented a CI\/CD pipeline using GitHub Actions that doesn&#8217;t just run unit tests. It runs &#8220;data tests.&#8221; It checks for null values, distribution shifts, and schema violations before a single weight is updated.<\/li>\n<li><strong>Logging is a First-Class Citizen:<\/strong> If a model isn&#8217;t logging its confidence scores and its input feature hashes to a centralized store, it doesn&#8217;t get deployed.<\/li>\n<li><strong>No More Pickles:<\/strong> All models must be exported to ONNX or TorchScript. No exceptions.<\/li>\n<\/ol>\n<p>Here is the CI configuration we used to enforce this:<\/p>\n<pre class=\"codehilite\"><code class=\"language-yaml\">name: ML_Pipeline_Validation\non: [push]\njobs:\n  test:\n    runs-on: ubuntu-latest\n    container:\n      image: my-registry\/ml-base:py311-cuda12.2\n    steps:\n      - uses: actions\/checkout@v4\n      - name: Install dependencies\n        run: |\n          pip install --upgrade pip\n          pip install -r requirements.lock\n      - name: Lint with Flake8\n        run: flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics\n      - name: Run Data Validation\n        run: python scripts\/validate_data_schema.py --input data\/sample.parquet\n      - name: Test Model Export\n        run: python scripts\/test_onnx_export.py --model_path models\/latest.pt\n<\/code><\/pre>\n<p>We also had to address the &#8220;black box&#8221; problem. When the model flagged a transaction as fraud, the customer support team needed to know why. We integrated SHAP (SHapley Additive exPlanations) 0.43.0 into the inference pipeline. This added 40ms of latency, but it saved hundreds of hours in manual reviews. Again, this was a plumbing challenge\u2014how to calculate SHAP values in a high-throughput environment without crashing the worker nodes.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Nuances_of_ONNX_Exports_and_Runtime_Optimization\"><\/span>The Nuances of ONNX Exports and Runtime Optimization<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Let&#8217;s talk about why we chose <code>onnxruntime 1.16.1<\/code> over just running the raw PyTorch model. When you run <code>model(input)<\/code> in PyTorch, you&#8217;re invoking a massive amount of Python overhead. For a single inference request, that might not matter. But when you&#8217;re scaling to thousands of requests per second, that overhead becomes a bottleneck.<\/p>\n<p>The ONNX export process forces you to define your input shapes. This is a good thing. It prevents the &#8220;dynamic shape&#8221; bugs that plague &#8220;machine learning&#8221; models in production. <\/p>\n<pre class=\"codehilite\"><code class=\"language-python\">import torch\nimport torch.onnx\nfrom models.sentinel import FraudModel\n\ndef export_to_onnx():\n    model = FraudModel()\n    model.load_state_dict(torch.load(&quot;weights\/best_v4.pt&quot;))\n    model.eval()\n\n    dummy_input = torch.randn(1, 128, requires_grad=True)\n\n    torch.onnx.export(\n        model,\n        dummy_input,\n        &quot;production_model.onnx&quot;,\n        export_params=True,\n        opset_version=17,\n        do_constant_folding=True,\n        input_names=['input'],\n        output_names=['output'],\n        dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}\n    )\n\nif __name__ == &quot;__main__&quot;:\n    export_to_onnx()\n<\/code><\/pre>\n<p>By using <code>opset_version=17<\/code>, we gained access to more efficient implementations of the LayerNorm and Softmax operations. We then used the <code>onnxruntime<\/code> quantization tools to convert the model from FP32 to INT8. This reduced the model size from 450MB to 115MB and increased our throughput by 3.5x. <\/p>\n<p>The researchers complained that INT8 quantization might drop the accuracy by 0.01%. I told them that a model that is 0.01% more accurate but crashes the server is 100% useless.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Managing_the_GPU_Driver_Nightmare\"><\/span>Managing the GPU Driver Nightmare<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The final piece of the puzzle was the hardware interface. NVIDIA Driver 535.104.05 introduced several changes to how memory is managed between the host and the device. We found that our older monitoring scripts were reporting incorrect GPU utilization because they were relying on an outdated version of <code>nvidia-ml-py<\/code>.<\/p>\n<p>We had to update our monitoring stack to use <code>dcgm-exporter<\/code> (Data Center GPU Manager) to get accurate, per-process memory and temperature readings. We integrated this into our Kubernetes 1.28 cluster using the NVIDIA Device Plugin.<\/p>\n<p>This allowed us to implement &#8220;Taint and Toleration&#8221; logic. If a node&#8217;s GPU temperature exceeded 80\u00b0C, the orchestrator would stop scheduling new inference jobs to that node and drain the existing ones. This prevented the &#8220;silent throttling&#8221; that had been causing our p99 latency to spike randomly in the afternoons.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion_The_Math_is_the_Easy_Part\"><\/span>Conclusion: The Math is the Easy Part<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>&#8220;Machine learning&#8221; is a discipline that has been hijacked by people who love algorithms but hate systems. They want to talk about stochastic gradient descent and transformer architectures, but they don&#8217;t want to talk about why their <code>requirements.txt<\/code> is missing <code>scipy<\/code>.<\/p>\n<p>The project that almost cost me my sanity didn&#8217;t fail because the math was wrong. It failed because the plumbing was non-existent. It failed because of a <code>pickle<\/code> version mismatch, a missing <code>ldconfig<\/code> entry, and a silent data drift that no one was monitoring.<\/p>\n<p>If you want to build a &#8220;machine learning&#8221; system that actually works, stop looking at the ROC curve for five minutes and look at your logs. Check your <code>pip freeze<\/code>. Verify your CUDA version. Test your ONNX export. <\/p>\n<p>Because when the system fails at 3:00 AM, the math won&#8217;t save you. Only your plumbing will.<\/p>\n<p>&#8220;`text<br \/>\n[2023-11-01 10:00:00.000] INFO: System Status: HEALTHY<br \/>\n[2023-11-01 10:00:00.001] INFO: Model Version: 2.5.0-onnx-int8<br \/>\n[2023-11-01 10:00:00.002] INFO: Python Version: 3.11.4<br \/>\n[2023-11-01 10:00:00.003] INFO: CUDA Version: 12.2<br \/>\n[2023-11-01 10:00:00.004] INFO: GPU Utilization: 42%<br \/>\n[2023-11-01 10:00:00.005] INFO: Inference Latency p99: 12ms<br \/>\n[2023-11-01 10:00:00.006] INFO: All systems nominal. Go back to sleep.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-k8s-orchestration\/\">What Is Kubernetes A Simple Guide To K8S Orchestration<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/react-best-practices-guide\/\">React Best Practices Guide<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/what-is-javascript-a-complete-beginners-guide\/\">What Is Javascript A Complete Beginners Guide<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>[2023-10-14 03:14:22.891] ERROR: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.50 GiB (GPU 0; 15.78 GiB total capacity; 11.20 GiB already allocated; 2.45 GiB free; 12.10 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-10-14 03:14:22.892] &#8230; <a title=\"Machine Learning Best Practices: A Guide to Success\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\" aria-label=\"Read more  on Machine Learning Best Practices: A Guide to Success\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4787","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Machine Learning Best Practices: A Guide to Success - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Machine Learning Best Practices: A Guide to Success - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"[2023-10-14 03:14:22.891] ERROR: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.50 GiB (GPU 0; 15.78 GiB total capacity; 11.20 GiB already allocated; 2.45 GiB free; 12.10 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-10-14 03:14:22.892] ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-12T17:19:42+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"Machine Learning Best Practices: A Guide to Success\",\"datePublished\":\"2026-05-12T17:19:42+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\"},\"wordCount\":1834,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\",\"name\":\"Machine Learning Best Practices: A Guide to Success - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-05-12T17:19:42+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Machine Learning Best Practices: A Guide to Success\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Machine Learning Best Practices: A Guide to Success - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/","og_locale":"en_US","og_type":"article","og_title":"Machine Learning Best Practices: A Guide to Success - ITSupportWale","og_description":"[2023-10-14 03:14:22.891] ERROR: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.50 GiB (GPU 0; 15.78 GiB total capacity; 11.20 GiB already allocated; 2.45 GiB free; 12.10 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-10-14 03:14:22.892] ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-05-12T17:19:42+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"Machine Learning Best Practices: A Guide to Success","datePublished":"2026-05-12T17:19:42+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/"},"wordCount":1834,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/","url":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/","name":"Machine Learning Best Practices: A Guide to Success - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-05-12T17:19:42+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/machine-learning-best-practices-a-guide-to-success\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Machine Learning Best Practices: A Guide to Success"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4787","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4787"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4787\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4787"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4787"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4787"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}