What is Artificial Intelligence? Definition, Types & Examples

AI is Just a Very Expensive Way to Guess the Next Word

In 2019, I was working for a fintech startup that decided we needed “predictive scaling.” The CTO had read a whitepaper about using a Recurrent Neural Network (RNN) to forecast traffic spikes based on historical patterns. We hooked it up to our Kubernetes cluster. One Tuesday, at 3:14 AM, the model decided that a minor blip in API latency from api.stripe.com was actually the start of a massive traffic surge. It triggered a scaling event that tried to provision 800 m5.4xlarge instances in us-east-1.

By the time I woke up to the PagerDuty alert, the AWS bill had climbed by $14,000. The “AI” hadn’t saved us from a spike; it had created a self-inflicted DDoS. The Kubelet on our master nodes was screaming under the pressure of managing that many pending pods. The scheduler was stuck in a loop. We weren’t “innovating”; we were just paying Jeff Bezos for the privilege of watching our control plane melt. That was the day I realized that most people talking about “what is” artificial intelligence have never actually had to clean up the mess it makes when it hits a production environment.

The Marketing Lie vs. The Mathematical Reality

If you listen to a VC, AI is a digital brain. If you listen to a mathematician, it’s a high-dimensional curve-fitting exercise. As an SRE, I see it as a non-deterministic black box that consumes an ungodly amount of VRAM and returns a probabilistic guess. When people ask what is AI, they usually want a philosophical answer. I’m going to give you the one that matters when you’re on call.

At its core, AI—specifically Machine Learning (ML)—is the shift from “explicit logic” to “statistical inference.” In the old days (five years ago), if we wanted to detect a fraudulent transaction, we wrote code like this:


def is_fraud(transaction):
    if transaction.amount > 10000 and transaction.location != user.home_city:
        return True
    if transaction.velocity_per_hour > 5:
        return True
    return False

This is deterministic. It’s easy to test. It’s easy to debug. You can look at the logs and know exactly why a transaction was flagged. AI replaces this with a weight matrix. Instead of if/else, you have y = mx + b, but m is a matrix with 175 billion parameters. You don’t write the rules; you show a “model” a million examples of fraud and let it calculate the weights that minimize a “loss function.”

Pro-tip: If your “AI” can be replaced by a CASE statement in SQL or a simple scikit-learn Random Forest, do it. You will save yourself months of “YAML-hell” trying to manage GPU drivers in a container.

The Architecture of a Guess

To understand what is happening inside these models, you have to stop thinking about “thinking.” A Neural Network is just a series of layers. Each layer is a bunch of numbers. When you pass an input (like a sentence or an image) into the network, it gets converted into a vector—a list of numbers.

Let’s look at a basic linear layer in PyTorch. This is the “hello world” of what people call AI:


import torch
import torch.nn as nn

# A simple linear layer: 10 inputs, 5 outputs
layer = nn.Linear(10, 5)

# Input data (a tensor of 10 numbers)
input_data = torch.randn(1, 10)

# The "Inference"
output = layer(input_data)
print(output)

That’s it. That is the “intelligence.” It’s a dot product and an addition. The “learning” part happens when we compare the output to the ground_truth and use an algorithm called Backpropagation to tweak the weights inside nn.Linear. We use the Chain Rule from calculus to figure out how much each weight contributed to the error. We do this millions of times until the error gets smaller.

We aren’t building a mind. We are building a very complex calculator that is really good at finding patterns in noisy data. The problem is that calculators don’t have “common sense.” If you train a model on data where every fraudulent transaction happens on a Friday, the model will learn that “Friday = Fraud.” This is called overfitting, and it’s why your “smart” features will fail the moment the real world changes.

Why LLMs are Different (And Why They Aren’t)

The current hype is centered on Large Language Models (LLMs) like GPT-4 or Llama 3. People think these are different because they can “reason.” They can’t. They are “Transformers.” The “Transformer” architecture, introduced by Google in the “Attention is All You Need” paper, changed everything because it allowed for parallelization.

Before Transformers, we used RNNs. RNNs processed text word-by-word. It was slow. It was like reading a book through a straw. Transformers use a mechanism called “Self-Attention.” This allows the model to look at every word in a sentence simultaneously and decide which words are most relevant to each other.

  • The Query (Q): What am I looking for?
  • The Key (K): What information do I have?
  • The Value (V): What is the actual content?
  • The Softmax: A way to turn these relationships into probabilities that add up to 1.

When you ask an LLM “what is the capital of France?”, it isn’t “knowing” the answer. It is calculating that, given the tokens “What”, “is”, “the”, “capital”, “of”, “France”, the most statistically probable next token is “Paris.” It is a stochastic parrot. A very, very impressive one, but a parrot nonetheless.

The SRE Perspective: AI is an Infrastructure Nightmare

Most blog posts about AI focus on the models. I want to talk about the nvidia-smi output at 2:00 AM. Productionizing AI is significantly harder than productionizing a CRUD app. In a standard Go or Node.js app, your primary constraints are CPU and I/O. In AI, your constraint is VRAM and memory bandwidth.

If you want to run a Llama-3-70B model in production, you aren’t just deploying a container. You are dealing with:

  • Quantization: You can’t fit a 70B parameter model in 16-bit precision on a single A100 (80GB). You have to “quantize” it to 4-bit or 8-bit. This is basically compressing the weights. It makes the model dumber but allows it to fit in memory.
  • CUDA Versions: Welcome to dependency hell. Your PyTorch version must match your CUDA toolkit version, which must match your NVIDIA driver version. If you update the host kernel, your whole inference stack might break.
  • Cold Starts: A 40GB model takes a long time to pull over the network and load into GPU memory. You can’t just “scale to zero” like you can with a Lambda function unless you want your users to wait 90 seconds for a response.
  • KV Caching: To make LLMs fast, we cache the “Keys” and “Values” of previous tokens. This eats VRAM like crazy. If your cache grows too large, you get an Out of Memory (OOM) error on the GPU, which is much harder to recover from than a standard Linux OOM.
  • Triton/vLLM: You need a specialized inference server to handle batching. If you send requests one by one, your GPU utilization will be 5%, and you’ll be burning money. You need “Continuous Batching” to keep the tensor cores busy.

# Example of checking GPU health on a production node
$ nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used --format=csv
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.used [MiB]
85 %, 64 %, 81920 MiB, 52428 MiB

If you see utilization.memory at 100% but utilization.gpu at 10%, you have a bottleneck in your data pipeline. Your CPU can’t feed the GPU fast enough. This is the kind of “what is AI” reality that doesn’t make it into the marketing slide decks.

The “Vector Database” Fad

You can’t talk about “what is” AI today without mentioning Vector Databases (Pinecone, Milvus, Weaviate). The idea is that you turn your data into “embeddings” (vectors) and store them so the LLM can “search” them. This is called Retrieval-Augmented Generation (RAG).

Here is my hot take: Most of you don’t need a specialized vector database. You need pgvector.

I’ve seen teams spend three months setting up a dedicated vector DB cluster, dealing with new consistency models and backup strategies, when they could have just added an extension to their existing Postgres RDS instance.


-- The "AI" way to search in Postgres
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id serial PRIMARY KEY,
    content text,
    embedding vector(1536) -- OpenAI's embedding size
);

-- Find the most similar document using cosine distance
SELECT content FROM documents 
ORDER BY embedding <=> '[0.123, 0.456, ...]' 
LIMIT 5;

Is it as fast as a specialized C++ engine optimized for HNSW (Hierarchical Navigable Small World) graphs? No. But it’s “production-ready” on day one. It follows ACID compliance. It’s in your existing backup routine. Don’t add architectural complexity until your p99s demand it.

The Hidden Cost: Data Drift and the “Silent Failure”

When a standard microservice fails, it returns a 500 error. You get an alert. You fix it. When an AI model fails, it returns a 200 OK with a perfectly formatted, confident, but entirely wrong answer. This is “hallucination,” but the more dangerous version is “Data Drift.”

Data drift happens when the distribution of the data you are seeing in production changes from the data you used to train the model. Imagine you trained a model to predict house prices in 2021. In 2023, interest rates spiked. The model still thinks it’s 2021 because its weights are frozen. It’s still giving you “accurate” predictions based on its training, but its predictions are now useless in the real world.

Monitoring this is a nightmare. You have to monitor the “distribution” of your inputs. You need to use tools like Great Expectations or WhyLabs to see if the mean and variance of your features are shifting.

Note to self: Always log the model version and the raw prompt in the metadata of your inference logs. If a user reports a “hallucination,” you need to be able to reproduce it exactly, which is nearly impossible if you’re using a non-deterministic temperature setting (> 0).

The Reality of “Prompt Engineering”

There is a whole cottage industry of “Prompt Engineers.” Let’s be clear: Prompt engineering is just “voodoo programming” for the 2020s. It’s the equivalent of hitting the side of a CRT monitor to get the picture to stop flickering.

If your system’s reliability depends on whether you told the model to “take a deep breath” or “I will tip you $200 for a correct answer,” you don’t have an engineering system. You have a fragile heuristic. Real AI engineering is about:

  1. Evaluation Frameworks: You need a suite of 1,000+ test cases (inputs and expected outputs) that you run every time you change a prompt or a model version.
  2. Few-Shot Prompting: Providing 5-10 examples of the task within the prompt to “prime” the model.
  3. Fine-Tuning: Actually updating the weights of a smaller model (like Mistral 7B) on your specific domain data instead of trying to coax a general-purpose model into behaving.
  4. Output Parsing: Using libraries like Instructor or Outlines to force the LLM to return valid JSON that matches a Pydantic schema. Never, ever use response.split(",") on an LLM output. It will break.

# Don't do this
prompt = f"Is this comment spam? {comment_text}. Answer yes or no."

# Do this (using a structured output library)
class SpamDetection(BaseModel):
    is_spam: bool
    confidence_score: float
    reasoning: str

client = instructor.patch(openai.OpenAI())
response = client.chat.completions.create(
    model="gpt-4",
    response_model=SpamDetection,
    messages=[{"role": "user", "content": comment_text}]
)

The “What Is” of the Future

We are currently in the “throw more GPUs at it” phase of AI. It reminds me of the early days of NoSQL, where everyone was dumping their relational data into MongoDB because “schemas are for losers,” only to realize three years later that data integrity actually matters.

Eventually, the hype will die down. We will stop calling it “AI” and start calling it “the probabilistic layer of the stack.” We will use it for things it’s good at—summarization, translation, fuzzy matching—and we will stop trying to use it for things it’s bad at—math, logic, and being a source of truth.

The real “intelligence” isn’t in the model. It’s in the engineering around the model. It’s in the rate limiting, the caching, the evaluation loops, and the fallback mechanisms. If the LLM fails, does your app crash? Or do you have a Trie-based regex matcher that can handle the basic cases?

Most companies don’t need an “AI Strategy.” They need a “Data Strategy.” You can’t build a 10th-floor penthouse (AI) if your foundation (data quality) is made of wet sand. I’ve seen teams spend millions on LLMs while their core database didn’t even have proper foreign key constraints. Fix your data first. The “intelligence” part is easy once the data is clean.

Stop treating AI like a magic wand and start treating it like a very temperamental, very expensive legacy service that you didn’t write but have to support. That is the only way to survive the hype cycle without losing your mind or your budget.

If you can’t explain why the model made a specific decision, don’t give it the keys to your production environment; keep it in a sandbox until you’ve built enough observability to catch it when it inevitably lies to you.

Leave a Comment