Table of Contents
Stop Calling It Magic: A Grumpy SRE’s Guide to What Is Machine Learning
It was 3 AM on a Tuesday in 2019. We had just deployed a “smart” auto-scaler for our edge nodes. The idea was simple: use a lightweight ML model to predict traffic spikes and spin up instances before the load hit. Instead, the model saw a routine cron job, interpreted the slight CPU blip as an exponential curve, and tried to provision 4,000 AWS c5.4xlarge instances in the us-east-1 region. We didn’t have the quota, but the attempt alone triggered a cascade of API rate limits that locked our entire control plane.
AWS throttled us. Our monitoring went dark. Because the “smart” scaler was also responsible for health checks, it started marking healthy nodes as “stale” because it couldn’t talk to the AWS API. We spent six hours manually killing rogue processes while the CFO watched the billing dashboard climb like a SpaceX rocket. That was my introduction to “AI-driven operations.” It wasn’t intelligent; it was a feedback loop with a credit card attached. If you want to know what is machine learning, start there: it is a system that fails in ways your unit tests can’t catch.
The Documentation Is Lying To You
If you search for “what is machine learning,” you’ll find a thousand blogs talking about “mimicking human intelligence” or “neural pathways.” That’s marketing garbage. It’s designed to sell VC seats and SaaS subscriptions. In reality, machine learning is just high-dimensional curve fitting. It’s a way to generate a function $f(x) = y$ when the logic is too messy for a human to write in a switch statement.
Most documentation ignores the infrastructure tax. They show you a Jupyter notebook where everything works on a .csv file stored on a laptop. They don’t tell you about the 15GB Docker images, the glibc version mismatches in your base image, or the fact that nvidia-smi will be the most important command in your troubleshooting toolkit. We’re moving away from “if-this-then-that” and moving toward “if-this-is-statistically-likely-then-maybe-that.” It’s a nightmare for anyone who cares about determinism.
Note to self: Never trust a model that hasn’t been tested against a dataset containing null bytes or 4-byte UTF-8 characters. It will segfault your inference engine.
The Meat: How It Actually Works (Without the Fluff)
At its core, machine learning is a three-part problem: Data, Model, and Compute. If you’re an SRE, you mostly care about the Compute and the Data pipeline, because that’s what breaks at 2:00 PM on a Friday.
To understand what is machine learning, you have to look at the training phase. You take a massive pile of data—let’s say 500GB of JSON logs from api.stripe.com—and you feed it into an algorithm. The algorithm tries to find patterns. It assigns “weights” (numbers) to different inputs. If it’s trying to predict fraud, it might give a high weight to “IP address from a known data center” and a low weight to “User has been active for 5 years.”
# This is what people think ML is
import torch
import torch.nn as nn
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.layer1 = nn.Linear(128, 64)
self.layer2 = nn.Linear(64, 1)
def forward(self, x):
return torch.sigmoid(self.layer2(torch.relu(self.layer1(x))))
# This is what ML actually is in production
try:
model.load_state_dict(torch.load('/mnt/models/v2/weights.pth'))
except RuntimeError as e:
print(f"Incompatible weights. Did someone change the hidden layer size without telling DevOps? {e}")
sys.exit(1)
The “Learning” part is just an optimization loop. The model makes a guess, calculates how wrong it was (the “Loss”), and uses backpropagation to tweak the weights. It does this millions of times. Eventually, you get a binary file—a “model”—that you can push to production. This file is a black box. You can’t grep it. You can’t diff it in a meaningful way. You just have to trust the validation metrics.
- Supervised Learning: You give the model the answers. “Here is a picture of a cat, it is a cat.” This is expensive because humans have to label the data.
- Unsupervised Learning: You give the model data and say “find something interesting.” This usually results in the model finding that “users who buy shoes also buy socks,” which your marketing team already knew.
The Infrastructure Tax: Why Your Kubelet Is Crying
When you ask “what is machine learning” from an operational perspective, the answer is “a resource hog.” Traditional microservices are easy to scale. You look at CPU and memory. ML adds a third dimension: the GPU.
If you’re running inference on a CPU, your p99 latency will likely be garbage (300ms+). If you move to GPUs, you’re now dealing with nvidia-container-runtime, specific CUDA versions (e.g., 12.1 vs 11.8), and the fact that a single A100 instance costs more per hour than your entire staging environment.
I once saw a team try to deploy a BERT-based NLP model into a standard Kubernetes cluster without setting resource limits. The model tried to pre-allocate 12GB of VRAM. The node only had 8GB. The Kubelet didn’t just kill the pod; the entire NVIDIA driver hung, requiring a hard reboot of the bare-metal host. We lost three other production pods on that node.
# A snippet of the YAML-hell you'll encounter
resources:
limits:
nvidia.com/gpu: 1
requests:
memory: "16Gi"
cpu: "4"
# Pro-tip: Always set shm-size for PyTorch.
# Default Docker shm-size is 64MB, which will OOM-kill your dataloaders.
Feature Engineering: The Real Work
Data scientists spend 90% of their time on “Feature Engineering.” This is a fancy term for “cleaning up the mess in the database.” If your input data is null, or if a dev changed a column name in localhost:5432 without updating the ETL pipeline, the model will fail. But it won’t fail with a 500 Internal Server Error. It will fail silently by giving a wrong answer with 99% confidence.
Consider a recommendation engine. You need to feed it “user_age,” “last_purchase_timestamp,” and “browser_type.”
- What if “user_age” is missing? Do you use 0? The average? -1? Each choice changes the model’s output.
- What if the timestamp is in UTC in the database but the model was trained on PST?
- What if the “browser_type” is “Mozilla/5.0…” and the model only expects “Chrome” or “Safari”?
- What if the data is skewed because 80% of your traffic comes from bots?
- What if the upstream API at
api.segment.iochanges its payload format?
This is why ML is hard. It’s not the math. It’s the data contract. In a standard app, a broken contract triggers an exception. In ML, a broken contract triggers a bad business decision.
The “Gotcha”: Training-Serving Skew
This is the silent killer of ML projects. Training-serving skew happens when the data the model sees during training is different from the data it sees in production.
Imagine you train a model to predict server failures. You use historical data from your Prometheus archives. In the archives, the data is aggregated every 5 minutes. But in production, your real-time monitoring feeds the model data every 10 seconds. The model, expecting 5-minute averages, sees the 10-second spikes and panics. It starts flagging every server as “about to explode.”
You can’t fix this with a better algorithm. You fix it by ensuring your feature pipeline is identical in both environments. This usually means using a “Feature Store,” which is just a very expensive database that both your training scripts and your production API can query.
Pro-tip: If someone suggests building a custom feature store in-house, quit your job. Use an off-the-shelf solution or just use a well-indexed Postgres table. Don’t reinvent the wheel with more YAML.
The Lifecycle of a Model (The SRE Version)
Most people think the lifecycle is: Research -> Train -> Deploy.
The reality is more like: Data Leakage -> OOM Kill -> Dependency Hell -> Silent Failure -> Rollback.
- Data Collection: You realize your
S3buckets are a mess and half the logs are missing. - Training: You burn $5,000 in AWS credits to get a model that is 2% better than a random guess.
- Packaging: You try to wrap the model in a Flask API. You realize
pandasandnumpyadd 800MB to your image size. - Deployment: You push to production. The
livenessProbefails because the model takes 45 seconds to load into memory. - Monitoring: You realize you have no idea if the model is working. You start logging every prediction to
BigQueryfor “later analysis.” - Drift: Three weeks later, the model’s accuracy drops because the world changed (e.g., a holiday season started) and the model doesn’t know what a “Black Friday” is.
What Is Machine Learning? It’s a Maintenance Burden.
If you can solve a problem with a regex, do it. If you can solve it with a JOIN, do it. If you can solve it with a set of hard-coded rules, do it. Machine learning should be your absolute last resort. Why? Because you can’t debug it.
When a customer asks “Why was my account flagged for fraud?”, and your answer is “The weights in the third hidden layer of our neural network were slightly higher for your specific latency profile,” you haven’t solved a problem. You’ve just automated an excuse.
We use ML at my current gig for image compression. It’s great. If it fails, the image just looks a bit pixelated. The stakes are low. But when people talk about using ML for “automated incident response” or “AI-driven security,” I reach for my pager. Those systems are brittle. They don’t handle “unknown unknowns.” They only know what they’ve seen before. And in the world of SRE, the thing that breaks your system is almost always something you’ve never seen before.
The Technical Debt of “Intelligence”
Google published a paper years ago called “Machine Learning: The High Interest Credit Card of Technical Debt.” Every SRE should read it. It points out that the actual ML code is a tiny fraction of the system. The rest is configuration, data collection, feature extraction, monitoring, and infrastructure management.
When you add a model to your stack, you aren’t just adding a library. You’re adding a dependency on the specific distribution of your input data. If your marketing team runs a campaign in a new country, your model might break. If your frontend team changes the UI and users start clicking differently, your model might break. It is the only type of code that “rots” even if you don’t change a single line of it.
# Monitoring for drift isn't just checking CPU.
# It's checking the distribution of your outputs.
def check_model_drift(current_predictions, baseline_distribution):
# Use something like Kolmogorov-Smirnov test
drift_score = ks_test(current_predictions, baseline_distribution)
if drift_score > 0.05:
trigger_alert("Model is hallucinating or the world changed.")
The Wrap-up
Machine learning is just a way to trade code complexity for data complexity. You stop writing logic and start managing pipelines. It’s not a silver bullet; it’s a heavy, expensive, non-deterministic cannon that requires a team of engineers to keep it pointed in the right direction. If you’re going to use it, make sure the problem you’re solving is worth the 3 AM pages and the GPU bill. Most of the time, it isn’t.
Don’t build a “smart” system until you have a “reliable” one. A simple script that works 100% of the time is infinitely better than a “neural network” that works 95% of the time and fails spectacularly the other 5%.
Related Articles
Explore more insights and best practices: