{"id":4796,"date":"2026-05-23T21:36:40","date_gmt":"2026-05-23T16:06:40","guid":{"rendered":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/"},"modified":"2026-05-23T21:36:40","modified_gmt":"2026-05-23T16:06:40","slug":"top-machine-learning-best-practices-for-better-models","status":"publish","type":"post","link":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/","title":{"rendered":"Top Machine Learning Best Practices for Better Models"},"content":{"rendered":"<p>text<br \/>\n[2024-10-14 03:14:22,891] ERROR:worker:Process &#8216;Icarus-Inference-Engine-7&#8217; terminated with signal 9 (SIGKILL)<br \/>\n[2024-10-14 03:14:22,892] CRITICAL:kernel:Out of Memory (OOM) killer invoked. Victim: python3.11 (pid 4402)<br \/>\n[2024-10-14 03:14:23,004] TRACEBACK:<br \/>\n  File &#8220;\/app\/inference\/model_loader.py&#8221;, line 142, in predict<br \/>\n    features = feature_store.get_latest(user_id)<br \/>\n  File &#8220;\/app\/data\/feature_store.py&#8221;, line 89, in get_latest<br \/>\n    df = pd.read_sql(query, engine)<br \/>\n  File &#8220;\/usr\/local\/lib\/python3.11\/site-packages\/pandas\/io\/sql.py&#8221;, line 1561, in read_sql<br \/>\n    return pandas_sql.read_query(sql, index_col=index_col, params=params)<br \/>\nMemoryError: Unable to allocate 14.2 GiB for an array with shape (1890221, 1024) and data type float64<br \/>\n[2024-10-14 03:14:25,112] SYSTEM: IPO-Launch-Dashboard status: CRITICAL. Conversion rate: 0.00%.<\/p>\n<pre class=\"codehilite\"><code>I\u2019m sitting in a dark room at 4:00 AM, staring at a flickering cursor. My eyes feel like they\u2019ve been rubbed with sandpaper. Outside, the sun is threatening to rise on what was supposed to be the most important day in the history of this company\u2014the day we went public. Instead, the S-1 filing is a joke, the bankers are pulling out, and the &quot;Project Icarus&quot; recommendation engine is currently eating itself alive in a Kubernetes cluster that\u2019s hemorrhaging money.\n\nWe didn't just fail. We committed technical suicide.\n\nI was the Lead Data Scientist. I\u2019m the one who signed off on the &quot;optimized&quot; weights. I\u2019m the one who told the CTO that we could scale. I lied. Not because I wanted to, but because we had spent eighteen months shoveling technical debt into a furnace to keep the hype train moving. This isn't a post about &quot;learnings&quot; or &quot;growth mindsets.&quot; This is an autopsy of a carcass.\n\n## The Dependency Hell We Ignored\n\nIt started with the environments. In the early days, we were &quot;agile.&quot; Agile is just a corporate euphemism for &quot;we don't have time to do it right.&quot; Every researcher on the team was running their own local version of the stack. We had people on M1 Macs, people on Linux workstations, and one guy still trying to make Windows Subsystem for Linux work with CUDA 11.8.\n\nWe never locked our versions. We used `pip install` like it was a candy dispenser. By the time we tried to containerize the model for production, the `requirements.txt` was a bloated, contradictory mess of conflicting binaries. We had `scikit-learn 1.4.2` trying to talk to a feature engineering script written for `scikit-learn 0.24`.\n\nWhen we finally pushed to the staging environment, the entire thing collapsed because `protobuf` decided to break backward compatibility for the tenth time that year. We &quot;fixed&quot; it by pinning versions at random until the errors stopped. We didn't solve the problem; we just buried it under a layer of brittle hacks.\n\n```text\n# The &quot;Final&quot; pip freeze that killed us\nnumpy==1.26.4\npandas==2.2.1\nscikit-learn==1.4.2\ntorch==2.2.0+cu121\ntensorflow==2.16.1\nprotobuf==4.25.3\ngrpcio==1.62.1\n# Why is this here? Nobody knows.\njoblib==1.3.2\n# This version conflicts with the transformer layer but we forced it anyway\ntransformers==4.38.2\n<\/code><\/pre>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a6938877719a\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a6938877719a\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#Feature_Store_or_Feature_Swamp\" >Feature Store or Feature Swamp?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#The_Reproducibility_Crisis_We_Built_a_Ghost\" >The Reproducibility Crisis: We Built a Ghost<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#Data_Leakage_The_False_Prophet_of_99_Accuracy\" >Data Leakage: The False Prophet of 99% Accuracy<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#The_Production_Meat_Grinder_OOM_Kills_and_Zombie_Processes\" >The Production Meat Grinder: OOM Kills and Zombie Processes<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#The_IPO_Death_Spiral_Why_Technical_Debt_is_a_High-Interest_Loan\" >The IPO Death Spiral: Why Technical Debt is a High-Interest Loan<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#The_Hyperparameter_Tuning_Rabbit_Hole\" >The Hyperparameter Tuning Rabbit Hole<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#The_Aftermath_Shoveling_the_Wreckage\" >The Aftermath: Shoveling the Wreckage<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#Related_Articles\" >Related Articles<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Feature_Store_or_Feature_Swamp\"><\/span>Feature Store or Feature Swamp?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The marketing team called it our &#8220;Real-Time Intelligence Layer.&#8221; Internally, we called it the Swamp. We didn&#8217;t have a feature store. We had a collection of unoptimized SQL queries and a series of CSV files sitting in an S3 bucket that were updated by &#8220;cron jobs&#8221; that failed 30% of the time.<\/p>\n<p>The biggest sin was the data leakage. We were predicting user churn. Our model had an AUC of 0.98 in training. We were heroes. The CEO was showing the charts to investors, bragging about our &#8220;predictive moat.&#8221; <\/p>\n<p>The reality? One of the junior engineers had included <code>is_active_customer<\/code> as a feature in the training set. We were literally using the answer to predict the question. Because we didn&#8217;t have a formal pipeline or a feature registry, this &#8220;leak&#8221; stayed in the code for six months. When we finally caught it and removed the feature, the AUC dropped to 0.52. We were basically flipping a coin, but the IPO roadshow had already started. We couldn&#8217;t tell the truth then. We just started &#8220;tuning&#8221; (read: over-fitting) until the numbers looked respectable again.<\/p>\n<p>We ignored the basic hygiene of machine learning. We didn&#8217;t version our data. We didn&#8217;t have a schema for our features. We just kept dumping raw JSON into a Snowflake instance and hoped the <code>dbt<\/code> models would sort it out. They didn&#8217;t. They just made the garbage more expensive to store.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Reproducibility_Crisis_We_Built_a_Ghost\"><\/span>The Reproducibility Crisis: We Built a Ghost<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Six weeks before the IPO attempt, the &#8220;Golden Model&#8221; stopped working. This was the specific iteration of our XGBoost model (using <code>xgboost 2.0.3<\/code>) that had the best performance on the holdout set. We needed to retrain it on the latest data.<\/p>\n<p>But we couldn&#8217;t.<\/p>\n<p>The original researcher had left for a HFT firm in Chicago. He didn&#8217;t leave a Dockerfile. He didn&#8217;t leave a random seed. He didn&#8217;t even leave the original preprocessing script. He had done some &#8220;manual cleaning&#8221; in a Jupyter Notebook that was now a 404 error on a decommissioned server. <\/p>\n<p>We spent three weeks trying to replicate his results. We matched the hyperparameters. We matched the data splits. We even tried to find the exact version of <code>pandas 1.5.3<\/code> he was using. It didn&#8217;t matter. The model was a ghost. Every time we retrained, the weights shifted. The performance fluctuated. We were chasing a phantom because we treated our models like artisanal crafts instead of engineered products.<\/p>\n<p>In ML, if you can&#8217;t reproduce it, it doesn&#8217;t exist. We had a production environment running a binary that no one in the company knew how to recreate. That is the definition of a ticking time bomb.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Data_Leakage_The_False_Prophet_of_99_Accuracy\"><\/span>Data Leakage: The False Prophet of 99% Accuracy<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The pressure to perform for the board led to a culture of &#8220;metric hacking.&#8221; When the model didn&#8217;t hit the targets, we didn&#8217;t look for better features; we looked for ways to make the test set easier. <\/p>\n<p>We had temporal leakage that would make a freshman CS student weep. We were using future information to predict past events because our &#8220;point-in-time&#8221; joins were broken. Our <code>SQL<\/code> joins were missing the <code>WHERE event_timestamp &lt; prediction_timestamp<\/code> clause in half the queries. <\/p>\n<p>The result was a model that looked like a god in the lab and acted like a drunkard in the field. When we went live for the &#8220;Beta Launch&#8221; during the IPO week, the model started recommending winter coats to people in Florida during a heatwave. Why? Because the training data was heavily skewed by a botched data migration from three years ago that we never bothered to clean up. We just &#8220;shoveled&#8221; more data into the model, thinking volume would compensate for quality. It\u2019s the classic mistake: thinking that a bigger pile of trash will eventually turn into gold.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Production_Meat_Grinder_OOM_Kills_and_Zombie_Processes\"><\/span>The Production Meat Grinder: OOM Kills and Zombie Processes<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Then came the actual deployment. We didn&#8217;t have an MLOps team. We had &#8220;DevOps&#8221; guys who hated Python and &#8220;Data Scientists&#8221; who didn&#8217;t know what a pointer was. <\/p>\n<p>We deployed our PyTorch models (version 2.2.0) inside a Flask wrapper. Flask. For a high-throughput inference engine. It was like trying to power a cruise ship with a lawnmower engine. <\/p>\n<p>The memory leaks were catastrophic. Python\u2019s garbage collection couldn&#8217;t keep up with the massive tensors we were throwing around. We didn&#8217;t use a model server like TorchServe or NVIDIA Triton. No, we wrote a custom wrapper because we thought we were clever. <\/p>\n<p>At 3:00 AM, the OOM kills started. The system would spin up a new pod, it would load the 8GB model into VRAM, handle three requests, and then the memory would spike until the Linux kernel stepped in and executed the process. <\/p>\n<pre class=\"codehilite\"><code class=\"language-text\"># Log from the Load Balancer during the crash\n[03:14:10] GET \/v1\/predict\/user\/88219 - 200 OK (450ms)\n[03:14:12] GET \/v1\/predict\/user\/99102 - 500 Internal Server Error (12000ms)\n[03:14:15] GET \/v1\/predict\/user\/10223 - 503 Service Unavailable\n[03:14:18] WARNING: High Latency detected on all nodes.\n[03:14:20] CRITICAL: Node-4 has entered a CrashLoopBackOff state.\n<\/code><\/pre>\n<p>The CEO was screaming in the <code>#war-room<\/code> Slack channel. &#8220;Why is the site down? The bankers are watching!&#8221; I was trying to explain the difference between <code>CUDA 12.1<\/code> and <code>11.8<\/code> drivers to a man who thinks &#8220;The Cloud&#8221; is an actual cloud. He didn&#8217;t care about the technical debt. He cared about the stock price that was currently evaporating.<\/p>\n<p>We had zombie processes everywhere. We were using <code>multiprocessing<\/code> in Python to try and handle the load, but we weren&#8217;t cleaning up the child processes. Each one was holding onto a slice of the GPU memory. Eventually, the entire cluster was just a graveyard of hung processes, and no amount of <code>kubectl delete pod<\/code> could save us.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_IPO_Death_Spiral_Why_Technical_Debt_is_a_High-Interest_Loan\"><\/span>The IPO Death Spiral: Why Technical Debt is a High-Interest Loan<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The failure of Project Icarus wasn&#8217;t a single bug. It was the cumulative weight of a thousand &#8220;we&#8217;ll fix it later&#8221; decisions. <\/p>\n<p>We didn&#8217;t have unit tests for our data.<br \/>\nWe didn&#8217;t have monitoring for model drift.<br \/>\nWe didn&#8217;t have a standardized environment.<br \/>\nWe didn&#8217;t have a way to roll back a bad deployment.<\/p>\n<p>When the model started failing, we didn&#8217;t even know it was failing for the first four hours. Our &#8220;monitoring&#8221; was just a dashboard that checked if the HTTP status was 200. It didn&#8217;t check if the <em>predictions<\/em> made any sense. The model was returning <code>NaN<\/code> for 40% of the requests, and our system was just passing those <code>NaNs<\/code> right to the front end. The UI broke. The checkout button disappeared. The IPO was dead in the water.<\/p>\n<p>The bankers saw the &#8220;technical instability&#8221; and the &#8220;lack of scalable infrastructure&#8221; and they did what bankers do: they protected their own skin. They lowered the valuation, then they delayed the offering, and then they just stopped calling.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Hyperparameter_Tuning_Rabbit_Hole\"><\/span>The Hyperparameter Tuning Rabbit Hole<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In the final weeks, we became obsessed with hyperparameter tuning as a Hail Mary. We thought if we could just squeeze another 0.5% of accuracy out of the model, the &#8220;business value&#8221; would magically fix the broken infrastructure. <\/p>\n<p>We ran thousands of Optuna (version 3.6.1) trials. We burned through $50,000 of AWS credits in a weekend. We were searching for the perfect combination of <code>learning_rate<\/code>, <code>max_depth<\/code>, and <code>subsample<\/code>. <\/p>\n<p>It was a distraction. We were polishing the brass on the Titanic. It didn&#8217;t matter if the <code>learning_rate<\/code> was 0.001 or 0.0005 if the input data was corrupted by a race condition in the feature pipeline. We found a &#8220;winning&#8221; set of parameters, but when we tried to deploy them, we realized the training script had a hard-coded path to a local directory on a laptop that had been wiped. <\/p>\n<p>This is the reality of &#8220;Burnout Data Science.&#8221; You spend your time fighting the tools instead of solving the problems. You use <code>pickle<\/code> to save your models because it&#8217;s easy, then you realize you can&#8217;t load them because the production environment has a slightly different version of a dependency, and <code>pickle<\/code> is a security nightmare anyway. We should have used ONNX. We should have used MLflow. We should have used our brains.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_Aftermath_Shoveling_the_Wreckage\"><\/span>The Aftermath: Shoveling the Wreckage<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I quit yesterday. I didn&#8217;t even give two weeks&#8217; notice. I just sent an email with the root password to the production cluster and a link to the documentation that I never finished writing. <\/p>\n<p>Project Icarus is being &#8220;sunsetted.&#8221; The company is looking for a buyer, probably some legacy firm that wants to &#8220;acquire the AI talent.&#8221; They don&#8217;t realize the talent is spent. We are all shells of ourselves, haunted by the sound of Slack notifications and the sight of a red Grafana dashboard.<\/p>\n<p>If you are a Data Scientist reading this, take it as a warning. When they tell you to &#8220;move fast and break things,&#8221; remember that you are the one who has to fix them at 3:00 AM. When they tell you that &#8220;data quality is a secondary concern,&#8221; they are lying. When they tell you that &#8220;we don&#8217;t need a formal deployment pipeline,&#8221; they are setting you on fire.<\/p>\n<p>Machine learning is not a magic wand. It is a high-maintenance, temperamental engine that requires rigorous engineering, absolute transparency, and a level of discipline that most &#8220;fast-growing&#8221; startups are unwilling to provide. <\/p>\n<p>We flew too close to the sun with wings made of unversioned Python scripts and &#8220;good enough&#8221; data. Now, we\u2019re just another crater in the history of the tech bubble.<\/p>\n<pre class=\"codehilite\"><code class=\"language-text\"># Final state of the Icarus repository\n$ git log --oneline -n 5\nf3a2b1c (HEAD -&gt; master) fix: try to stop the OOM kills by reducing batch size to 1\na4d5e6f hotfix: remove the leaked feature again (who put this back?)\nb7c8d9e emergency: bypass the feature store and read directly from S3\nc0d1e2f docs: update README (just kidding, it's still empty)\ne3f4g5h initial commit: Project Icarus - The Future of Churn Prediction\n<\/code><\/pre>\n<p>The future is currently a <code>503 Service Unavailable<\/code> error. I\u2019m going to go sleep for a month. Don&#8217;t call me. Don&#8217;t even send an automated email. I\u2019ve had enough of &#8220;intelligence&#8221; for one lifetime. I\u2019m going to go find a job where the only &#8220;models&#8221; I deal with are made of wood and don&#8217;t require <code>CUDA 12.1<\/code> to function.<\/p>\n<p>The technical debt has been called in. And we are all bankrupt.<\/p>\n<hr \/>\n<p><strong>Technical Appendix: The Tools That Failed Us (Because We Failed Them)<\/strong><\/p>\n<ol>\n<li><strong>Scikit-learn 1.4.2<\/strong>: Used for the initial prototyping. Great tool, but we treated it like a black box. We didn&#8217;t understand the underlying algorithms, so when the <code>RandomForest<\/code> started over-fitting, we just added more trees until the memory exploded.<\/li>\n<li><strong>Pandas 2.2.1<\/strong>: The backbone of our feature engineering. We used it for everything, including things that should have been done in SQL. We were loading 20GB dataframes into 16GB of RAM and wondering why the kernel was killing our processes.<\/li>\n<li><strong>PyTorch 2.2.0+cu121<\/strong>: Our deep learning framework. We didn&#8217;t use <code>Lightning<\/code> or any abstraction. We wrote raw training loops that were full of logic errors and didn&#8217;t handle device placement correctly.<\/li>\n<li><strong>TensorFlow 2.16.1<\/strong>: Used by the &#8220;legacy&#8221; team. We tried to bridge the two frameworks using a custom bridge that was basically just <code>numpy<\/code> arrays passed through a socket. The latency was astronomical.<\/li>\n<li><strong>Docker<\/strong>: We used it, but we didn&#8217;t understand it. Our images were 12GB because we kept installing <code>nvidia-cuda-toolkit<\/code> inside the container instead of using the base images correctly.<\/li>\n<li><strong>Kubernetes<\/strong>: The &#8220;orchestrator&#8221; that mostly just orchestrated our demise. We didn&#8217;t set resource limits or requests correctly, so one runaway model would starve the entire cluster of resources.<\/li>\n<\/ol>\n<p>This is not a &#8220;how-to&#8221; guide. This is a &#8220;how-not-to.&#8221; If you see your team doing any of this, run. Or at least, make sure your resume is updated and your LinkedIn is set to &#8220;Open to Work.&#8221; Because the meltdown is coming, and no amount of &#8220;vibrant&#8221; corporate culture can stop a <code>SIGKILL<\/code>.<\/p>\n<p>[EOF]<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Related_Articles\"><\/span>Related Articles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Explore more insights and best practices:<\/p>\n<ul>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/fixed-nginx-showing-blank-php-pages-with-fastcgi-or-php-fpm\/\">Fixed Nginx Showing Blank Php Pages With Fastcgi Or Php Fpm<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/essential-cybersecurity-best-practices-to-protect-your-data\/\">Essential Cybersecurity Best Practices To Protect Your Data<\/a><\/li>\n<li><a href=\"https:\/\/itsupportwale.com\/blog\/what-is-kubernetes-a-simple-guide-to-k8s-orchestration\/\">What Is Kubernetes A Simple Guide To K8S Orchestration<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>text [2024-10-14 03:14:22,891] ERROR:worker:Process &#8216;Icarus-Inference-Engine-7&#8217; terminated with signal 9 (SIGKILL) [2024-10-14 03:14:22,892] CRITICAL:kernel:Out of Memory (OOM) killer invoked. Victim: python3.11 (pid 4402) [2024-10-14 03:14:23,004] TRACEBACK: File &#8220;\/app\/inference\/model_loader.py&#8221;, line 142, in predict features = feature_store.get_latest(user_id) File &#8220;\/app\/data\/feature_store.py&#8221;, line 89, in get_latest df = pd.read_sql(query, engine) File &#8220;\/usr\/local\/lib\/python3.11\/site-packages\/pandas\/io\/sql.py&#8221;, line 1561, in read_sql return pandas_sql.read_query(sql, index_col=index_col, params=params) MemoryError: &#8230; <a title=\"Top Machine Learning Best Practices for Better Models\" class=\"read-more\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\" aria-label=\"Read more  on Top Machine Learning Best Practices for Better Models\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4796","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Top Machine Learning Best Practices for Better Models - ITSupportWale<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Top Machine Learning Best Practices for Better Models - ITSupportWale\" \/>\n<meta property=\"og:description\" content=\"text [2024-10-14 03:14:22,891] ERROR:worker:Process &#8216;Icarus-Inference-Engine-7&#8217; terminated with signal 9 (SIGKILL) [2024-10-14 03:14:22,892] CRITICAL:kernel:Out of Memory (OOM) killer invoked. Victim: python3.11 (pid 4402) [2024-10-14 03:14:23,004] TRACEBACK: File &#8220;\/app\/inference\/model_loader.py&#8221;, line 142, in predict features = feature_store.get_latest(user_id) File &#8220;\/app\/data\/feature_store.py&#8221;, line 89, in get_latest df = pd.read_sql(query, engine) File &#8220;\/usr\/local\/lib\/python3.11\/site-packages\/pandas\/io\/sql.py&#8221;, line 1561, in read_sql return pandas_sql.read_query(sql, index_col=index_col, params=params) MemoryError: ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\" \/>\n<meta property=\"og:site_name\" content=\"ITSupportWale\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-23T16:06:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Techie\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Techie\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\"},\"author\":{\"name\":\"Techie\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\"},\"headline\":\"Top Machine Learning Best Practices for Better Models\",\"datePublished\":\"2026-05-23T16:06:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\"},\"wordCount\":1991,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\",\"name\":\"Top Machine Learning Best Practices for Better Models - ITSupportWale\",\"isPartOf\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\"},\"datePublished\":\"2026-05-23T16:06:40+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/itsupportwale.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Top Machine Learning Best Practices for Better Models\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#website\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"name\":\"ITSupportWale\",\"description\":\"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides\",\"publisher\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#organization\",\"name\":\"itsupportwale\",\"url\":\"https:\/\/itsupportwale.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"contentUrl\":\"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png\",\"width\":1119,\"height\":144,\"caption\":\"itsupportwale\"},\"image\":{\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Itsupportwale-298547177495978\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d\",\"name\":\"Techie\",\"sameAs\":[\"https:\/\/itsupportwale.com\",\"iswblogadmin\"],\"url\":\"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Top Machine Learning Best Practices for Better Models - ITSupportWale","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/","og_locale":"en_US","og_type":"article","og_title":"Top Machine Learning Best Practices for Better Models - ITSupportWale","og_description":"text [2024-10-14 03:14:22,891] ERROR:worker:Process &#8216;Icarus-Inference-Engine-7&#8217; terminated with signal 9 (SIGKILL) [2024-10-14 03:14:22,892] CRITICAL:kernel:Out of Memory (OOM) killer invoked. Victim: python3.11 (pid 4402) [2024-10-14 03:14:23,004] TRACEBACK: File &#8220;\/app\/inference\/model_loader.py&#8221;, line 142, in predict features = feature_store.get_latest(user_id) File &#8220;\/app\/data\/feature_store.py&#8221;, line 89, in get_latest df = pd.read_sql(query, engine) File &#8220;\/usr\/local\/lib\/python3.11\/site-packages\/pandas\/io\/sql.py&#8221;, line 1561, in read_sql return pandas_sql.read_query(sql, index_col=index_col, params=params) MemoryError: ... Read more","og_url":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/","og_site_name":"ITSupportWale","article_publisher":"https:\/\/www.facebook.com\/Itsupportwale-298547177495978","article_published_time":"2026-05-23T16:06:40+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2021\/05\/android-chrome-512x512-1.png","type":"image\/png"}],"author":"Techie","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Techie","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#article","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/"},"author":{"name":"Techie","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d"},"headline":"Top Machine Learning Best Practices for Better Models","datePublished":"2026-05-23T16:06:40+00:00","mainEntityOfPage":{"@id":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/"},"wordCount":1991,"commentCount":0,"publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/","url":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/","name":"Top Machine Learning Best Practices for Better Models - ITSupportWale","isPartOf":{"@id":"https:\/\/itsupportwale.com\/blog\/#website"},"datePublished":"2026-05-23T16:06:40+00:00","breadcrumb":{"@id":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/itsupportwale.com\/blog\/top-machine-learning-best-practices-for-better-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/itsupportwale.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Top Machine Learning Best Practices for Better Models"}]},{"@type":"WebSite","@id":"https:\/\/itsupportwale.com\/blog\/#website","url":"https:\/\/itsupportwale.com\/blog\/","name":"ITSupportWale","description":"Tips, Tricks, Fixed-Errors, Tutorials &amp; Guides","publisher":{"@id":"https:\/\/itsupportwale.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/itsupportwale.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/itsupportwale.com\/blog\/#organization","name":"itsupportwale","url":"https:\/\/itsupportwale.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","contentUrl":"https:\/\/itsupportwale.com\/blog\/wp-content\/uploads\/2023\/09\/cropped-Logo-trans-without-slogan.png","width":1119,"height":144,"caption":"itsupportwale"},"image":{"@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Itsupportwale-298547177495978"]},{"@type":"Person","@id":"https:\/\/itsupportwale.com\/blog\/#\/schema\/person\/8c5a2b3d36396e0a8fd91ec8242fd46d","name":"Techie","sameAs":["https:\/\/itsupportwale.com","iswblogadmin"],"url":"https:\/\/itsupportwale.com\/blog\/author\/iswblogadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4796","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/comments?post=4796"}],"version-history":[{"count":0,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/posts\/4796\/revisions"}],"wp:attachment":[{"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/media?parent=4796"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/categories?post=4796"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itsupportwale.com\/blog\/wp-json\/wp\/v2\/tags?post=4796"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}