Top Artificial Intelligence Best Practices for Success

text
[2023-10-27T14:22:01.442Z] kernel: [12409.552101] python3[14201]: segfault at 0 ip 00007f8e12a34b12 sp 00007ffc8e12a340 error 4 in libtorch_cuda.so[7f8e10000000+12a34000]
[2023-10-27T14:22:01.443Z] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.50 GiB (GPU 0; 23.65 GiB total capacity; 18.21 GiB already allocated; 4.12 GiB free; 19.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
[2023-10-27T14:22:01.445Z] Terminated: 15 (SIGTERM)
[2023-10-27T14:22:01.446Z] Environment: Ubuntu 22.04.3 LTS, Python 3.10.12, PyTorch 2.0.1+cu118, NVIDIA-SMI 535.104.05, Driver 535.104.05, CUDA 12.1, Hardware: 1x RTX 3090 (24GB), 64GB DDR4 RAM, i9-12900K.


Listen close, kid. I saw you staring at that stack trace like it was written in Linear A. You think you’re doing “artificial intelligence” because you imported a library that’s larger than the entire operating system I used to run a bank on in 1992. You’re not. You’re just piling abstractions on top of a leaking basement. You sent that job to the GPU without checking the memory map, didn’t you? You trusted the “magic” of the caching allocator. Now the kernel is screaming, the OOM killer is sharpening its knife, and you’re wondering why your “best practices” didn’t save you.

Sit down. I’m going to walk you through the wreckage of this migration. Maybe if I document these grievances, you’ll stop treating the hardware like an infinite resource and start treating it like the finicky, silicon-etched beast it actually is.

Log Entry 1: The Dependency Hellscape and the Myth of Reproducibility

[2023-10-27T15:04:12+00:00]
State: pip freeze > requirements.txt (A document of lies)
Environment: venv isolated, yet bleeding.

The first thing you did was run pip install torch torchvision torchaudio. You thought that was enough. You didn’t check the glibc version on the host. You didn’t check if libstdc++.so.6 was pointing to a version that actually supports the symbols required by the pre-compiled binaries you just shoved into your /site-packages.

In 1984, we wrote Makefiles. We knew where every header lived. Today, you have “artificial intelligence” frameworks that pull in 4GB of dependencies just to multiply two matrices. Your first “best practice” is this: If you cannot reproduce the environment down to the specific shared object hash, you do not have a model; you have a coincidence.

Look at your pip list. You have numpy==1.24.3 and pandas==2.0.3. But wait, another sub-dependency pulled in a different version of six or requests, and now your runtime is a minefield of ImportError: cannot import name '...' from '...'.

You need to use a lockfile. Not a suggestion, a lockfile. poetry.lock or conda-lock. And even then, you’re at the mercy of the Python package index. I’ve seen “artificial intelligence” projects die because a developer deleted a repository on GitHub that a setup script was curling in the background.

Stop using pip install. Use a container, but don’t use a “vibrant” base image that updates every night. Use a specific SHA256 hash of a Debian Slim or Alpine image. If you don’t control the bytes, the bytes will eventually control you.

# Example of what you SHOULD have run to see the rot:
ldd /home/user/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
# Check for "not found" or version mismatches in the output.
# If you see a mismatch in GLIBC_2.34, your "modern" OS is too old for your "modern" AI.

Log Entry 2: Data Ingestion and the Silent Failure of Sanitization

[2023-10-27T16:45:33+00:00]
State: DataLoader hung at __next__()
Environment: num_workers=16, pin_memory=True

You tried to feed the beast. You pointed your DataLoader at a directory of 10 million JPEGs and wondered why the CPU usage hit 100% while the GPU sat at 0% utilization. You’re starving the silicon, kid.

“Artificial intelligence” is 90% IO and 10% math, but you spent all your time on the math. You didn’t check for corrupted headers. You didn’t check for zero-byte files. You didn’t check for NaNs in your CSVs. One single NaN in a weight initialization or a training sample, and your loss function becomes inf. You’ve spent $400 of the company’s cloud credits training a model to output “Nothing.”

The “best practice” here is defensive data engineering. You don’t trust the data. You grep it, you awk it, you validate the checksums.

# Your junior-level mistake:
dataset = MyDataset(root_dir="./data")
loader = DataLoader(dataset, batch_size=64, num_workers=16)

# What you should have done:
# 1. Check for file integrity before the loop.
# 2. Use mmap (memory mapping) for large datasets to avoid copying buffers.
# 3. Profile the bottleneck using 'iostat -x 1'.

If your disk latency is high, your “artificial intelligence” is just a very expensive way to wait for a spinning platter or a saturated NVMe bus. And for the love of Ken Thompson, stop using pandas for datasets larger than your RAM. Use vaex or polars, or better yet, write a binary format that matches the memory layout of your input tensors. Every time you convert a string to a float in a training loop, a kernel architect loses their wings.

Log Entry 3: The VRAM Mirage and the Caching Allocator’s Lies

[2023-10-27T18:12:09+00:00]
State: nvidia-smi showing 23.5GB/24GB used.
Environment: PyTorch 2.0.1, max_split_size_mb unset.

This is where you hit the wall today. CUDA_ERROR_OUT_OF_MEMORY. You looked at nvidia-smi and saw you had 4GB free, so you tried to allocate a 2GB tensor. It failed. Why? Fragmentation.

The PyTorch caching allocator is a black box that tries to be smarter than the driver. It holds onto memory blocks because cudaMalloc is an expensive syscall. But if your model has varying sequence lengths or you’re doing dynamic graph construction, you end up with a “Swiss cheese” memory map. You have 4GB total, but the largest contiguous block is 512MB.

You didn’t profile your memory. You didn’t use torch.cuda.memory_summary(). You just kept bumping the batch size because some blog post told you it would “improve convergence.”

Best Practice: Deterministic Memory Budgeting.
You calculate the memory footprint of your weights, your gradients, and your optimizer states (Adam takes 2x the weight memory, kid, learn it). Then you leave a 15% buffer for the kernel and the context.

# Run this while your script is dying:
nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv -l 1

If you see utilization.gpu dropping to 0% while memory.used stays high, you’re either in a deadlock or you’re thrashing the swap because you forgot that pin_memory=True consumes host RAM. You’re trying to run a marathon while breathing through a straw.

Log Entry 4: The Latency Lie and the Python Tax

[2023-10-27T20:30:00+00:00]
State: strace -c output showing excessive futex calls.
Environment: FastAPI wrapper around a Transformer model.

You finally got the model trained. Now you want to “deploy” it. You wrapped it in a FastAPI web server because that’s what the “artificial intelligence” tutorials told you to do. Now your p99 latency is 450ms for a task that takes the GPU 15ms to compute.

Where is the time going? It’s going to the Global Interpreter Lock (GIL). It’s going to JSON serialization. It’s going to the context switch between the user-space Python process and the kernel-space network stack.

You’re using uvicorn with 4 workers, each loading a 10GB model into VRAM. You just ran out of memory again, didn’t you? Because you didn’t realize that each worker is a separate process with its own copy of the weights unless you’re using a shared memory segment or a model server that actually understands how to manage resources.

Best Practice: Move the Inference out of the Interpreter.
If you care about performance, you export to ONNX or TensorRT. You write a C++ or Rust wrapper. You use gRPC with protobuf instead of bloated JSON.

# See how much time you're wasting in syscalls:
strace -p <python_pid> -c

If I see select() or poll() taking up 40% of your execution time, I’m pulling the plug. “Artificial intelligence” doesn’t excuse sloppy systems engineering. You’re building a bridge out of toothpicks and wondering why it sways in the wind.

Log Entry 5: Determinism, Seeds, and the Ghost in the Machine

[2023-10-27T22:15:45+00:00]
State: seed=42 set, yet results differ across runs.
Environment: Multi-GPU training via DistributedDataParallel.

You came to me crying that your model isn’t reproducible. “But Jenkins, I set random.seed(42) and np.random.seed(42)!”

Did you set torch.backends.cudnn.deterministic = True? Did you set torch.backends.cudnn.benchmark = False? No, because you wanted that extra 5% throughput. Well, the cuDNN autotuner picked a different convolution algorithm on run #2 because the GPU temperature was 5 degrees higher and the clock speed throttled.

In “artificial intelligence,” non-determinism is a cancer. If you can’t reproduce a bug, you can’t fix the bug. If your weights drift because of floating-point accumulation errors in a non-deterministic atomic addition on the GPU, you’re not doing science; you’re doing alchemy.

Best Practice: Lock the State.
You lock the seeds, you lock the algorithms, and you document the hardware. If you move from an A100 to an H100, your results will change. If you change your version of CUDA from 11.8 to 12.1, the underlying PTX instructions change.

# The bare minimum for sanity:
import torch
import numpy as np
import random

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    import os
    os.environ['PYTHONHASHSEED'] = str(seed)

And even then, you’re still at the mercy of the DataLoader‘s multi-processing. If the OS schedules worker #3 before worker #1, your data order changes, and your gradient descent takes a different path down the manifold. You need to use a sampler that is tied to the global seed.

Log Entry 6: The Cost of Abstraction and the Final Grievance

[2023-10-28T01:00:00+00:00]
State: Cloud bill exceeded monthly budget in 4 days.
Environment: Kubernetes cluster with “Auto-scaling” enabled.

The final insult. You put your “artificial intelligence” pipeline on a Kubernetes cluster with auto-scaling. You thought the “cloud” would handle the load. But your liveness probes were failing because the model took 60 seconds to load into memory, so Kubernetes kept killing the pod and restarting it in a death loop. Each restart pulled 10GB of container images over the network. Your egress costs are now higher than your rent.

You’ve ignored the “unsexy” parts: the cold-start latency, the health-check timeouts, the resource limits in your YAML files.

# Your broken deployment.yaml
resources:
  limits:
    nvidia.com/gpu: 1
    memory: "16Gi" # Model is 15.5Gi. You forgot the overhead.
  requests:
    nvidia.com/gpu: 1
    memory: "16Gi"

When the model tries to allocate a scratch buffer for the attention mechanism, the pod hits the 16Gi limit and the kernel sends a SIGKILL. You don’t even get a stack trace. Just a OOMKilled status and a confused junior developer.

Best Practice: Profile the Baseline.
Before you ever touch a cloud provider, you run your workload on a local machine with a profiler. You find the “steady state” memory usage. You find the “peak” memory usage. You set your limits at 1.2x the peak.

“Artificial intelligence” is not a get-out-of-jail-free card for basic systems architecture. It is a high-performance computing (HPC) workload. If you treat it like a CRUD app, it will break your heart and your bank account.

Stop looking for “magic” solutions. Stop reading hype-filled blogs about “shaping the future.” Go back to the basics. Check your pointers. Watch your memory alignment. Profile your IO. If you can’t explain what every byte in your VRAM is doing, you haven’t finished your job.

Now, clear that /tmp directory, reset the XID error on the GPU with nvidia-smi -r, and start over. And this time, use a debugger, not a “vibe.”


Grievance Log Closed.
Status: Kernel Tainted.
Author: Jenkins, Senior Kernel Architect (Ret.)

Leave a Comment