Table of Contents
AWS is Just Someone Else’s Data Center (And Why You’re Paying Too Much for It)
It was 3:14 AM on a Tuesday when my PagerDuty went off. The alert was a generic TargetResponseTime high on our main production ALB. I logged in, eyes blurry, and saw that our NAT Gateway was processing 4.5 Gbps of traffic, which was odd because our baseline was 200 Mbps. A junior dev had pushed a “minor” change to a data processing worker that, instead of hitting an internal S3 endpoint, was routing all traffic through the public internet. By the time I killed the process and rolled back the deployment, we had racked up $12,000 in data transfer fees in under four hours. AWS didn’t care that it was a mistake; the API calls were valid, the packets moved, and the billing meter kept spinning.
That is the reality of the cloud. It isn’t a magical “elastic” solution to all your problems. It is a hyper-optimized billing engine that happens to provide virtualized compute and storage. When people ask “what is aws,” they usually want a definition of cloud computing. I’m not going to give you the marketing fluff about “digital transformation.” I’m going to tell you what it actually is: a massive collection of APIs that abstract away the pain of racking servers, but replace that pain with the complexity of distributed systems and a bill that looks like a phone book.
The API Abstraction: What is AWS, Really?
At its core, AWS (Amazon Web Services) is a set of remote APIs. That’s it. When you click a button in the console to launch an EC2 instance, you aren’t “creating” a server. You are sending a POST request to an endpoint like ec2.us-east-1.amazonaws.com. Amazon’s backend orchestration layer—which is likely a mix of Java, Go, and legacy systems we’ll never see—receives that request, checks your IAM permissions, verifies your credit card isn’t expired, and then finds a slice of a physical host in a data center in Northern Virginia to carve out for you.
The documentation makes this sound effortless. It isn’t. You are dealing with shared responsibility. Amazon manages the physical security, the hypervisor, and the cooling fans. You manage the OS patches, the firewall rules (Security Groups), and the fact that your application will eventually crash because of a java.lang.OutOfMemoryError. If you think moving to AWS means you don’t need an Ops team, you’ve already lost. You just need an Ops team that knows how to read JSON instead of how to crimp CAT6 cables.
Pro-tip: Never trust the AWS Web Console for production changes. It’s a great way to accidentally leave a
db.m5.4xlargerunning over the weekend. Use Terraform or Pulumi. If it isn’t in code, it doesn’t exist.
Compute: Renting Someone Else’s CPU
When we talk about compute in AWS, we usually mean EC2 (Elastic Compute Cloud). But the “what is” of compute has shifted. It’s no longer just about VMs. It’s about how much of the stack you’re willing to manage.
- EC2: You choose the AMI (Amazon Machine Image), you manage the kernel, you handle the SSH keys. You get a
Connection Refusederror at 2 AM because your disk filled up with logs. - Lambda: You provide a ZIP file of code. AWS runs it. You pay for the execution time. Sounds great until you hit a “Cold Start” and your API latency spikes to 3 seconds because your Java runtime takes forever to initialize.
- Fargate: You provide a Docker image. AWS runs the container. You don’t see the underlying host. It’s “serverless” containers, but you’ll still spend half your life debugging why your
ENTRYPOINTscript is failing withexec format error. - EKS: Managed Kubernetes. It’s AWS’s way of saying “We know Kubernetes is hard, so we’ll charge you $0.10 an hour just to keep the Control Plane alive.”
Most people start with EC2. They pick a t3.micro because it’s cheap. Then they realize that “burstable” CPU means that once you run out of CPU credits, your server performs like a calculator from 1995. For production, you use the m or c families. We use m6i.large for most of our general-purpose workloads. It’s a solid balance of memory and compute, and the Nitro system means the I/O overhead is negligible.
# Example: Checking instance metadata from within an EC2 node
# This is how your app knows "where" it is.
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id
# Output: i-04983274982374abc
If you’re still using IMDSv1 (the one without the token), stop. It’s a massive security hole. SSRF vulnerabilities in your web app can leak your IAM role credentials if you don’t enforce IMDSv2. I’ve seen entire AWS accounts compromised because a dev left a debugging endpoint open that could proxy requests to the metadata service.
Storage: The S3 Rabbit Hole
What is AWS without S3? S3 (Simple Storage Service) is arguably the most successful API in the history of the internet. It’s an object store, not a file system. This is a distinction that trips up everyone. You don’t “edit” a file in S3. You overwrite it.
S3 is “highly durable,” meaning they won’t lose your data (99.999999999% durability). But it is only “highly available” if you design it that way. I’ve seen us-east-1 go down and take half the internet with it because everyone put their buckets in the same region. If your business depends on S3, you need Cross-Region Replication (CRR).
And then there’s the cost. S3 storage is cheap. S3 requests and data transfer are where they get you. If you have a script that calls LIST on a bucket with 10 million objects every minute, your bill will explode. We once had a legacy PHP app that checked for the existence of a config.json in S3 on every single page load. 1,000 requests per second. The HEAD request costs were higher than our EC2 costs.
- Standard: For data you touch often.
- Intelligent Tiering: Use this if you’re lazy. It moves stuff to cheaper tiers based on access patterns. It costs a small monitoring fee per object, but it usually saves money.
- Glacier Deep Archive: For the stuff you need to keep for legal reasons but hope to never see again. Recovery takes 12 hours. It’s basically digital cold storage.
- EBS (Elastic Block Store): This is the “hard drive” for your EC2. Pro-tip: Use
gp3instead ofgp2. It’s cheaper and you can scale IOPS independently of storage size. There is zero reason to usegp2in 2024.
Networking: The Hidden Tax
This is where the “what is” of AWS gets complicated. Networking in AWS is handled by the VPC (Virtual Private Cloud). Think of it as your own private slice of the Amazon network. You define your IP ranges (CIDR blocks), your subnets, and your routing tables.
The biggest mistake I see? Putting everything in a public subnet. If your database has a public IP, you’ve already failed. Your database should be in a private subnet, accessible only via a VPN or a Bastion host (or better yet, SSM Session Manager).
But the real killer is the NAT Gateway. To let your private servers talk to the internet (to run apt-get update or hit an external API), you need a NAT Gateway. AWS charges $0.045 per hour for the gateway itself, plus $0.045 per GB of data processed. That “processed” fee is a double-dip. You pay for the data leaving your EC2 instance, and then you pay for the NAT Gateway to “process” it. If you’re moving terabytes of data to an external S3 bucket or a third-party API like api.stripe.com, you are getting hosed.
# A typical Terraform VPC snippet that people copy-paste without thinking
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
}
resource "aws_subnet" "private_1" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
}
Note that 10.0.0.0/16 gives you 65,536 IP addresses. You will never need that many. But if you pick a range that overlaps with your corporate VPN or another VPC you need to peer with later, you are in for a world of hurt. Re-IPing a VPC is a nightmare that involves rebuilding everything from scratch.
IAM: The “Who Can Do What” Nightmare
IAM (Identity and Access Management) is the most powerful and most frustrating part of AWS. It defines who (a user or a service) can do what (an action) on which resource. Most people start by giving themselves AdministratorAccess and calling it a day. Then they create an “app-user” and give it AmazonS3FullAccess.
This is how you get pwned. If that app-user’s access keys are leaked (e.g., pushed to a public GitHub repo), the attacker can delete every bucket in your account. The principle of least privilege isn’t just a buzzword; it’s a survival strategy. Your IAM policy should look like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::production-assets-bucket-992837/*"
}
]
}
Notice the specific bucket ARN. Notice the specific actions. No s3:*. No Resource: "*". If I see a wildcard in a production IAM policy during a PR review, I reject it immediately.
Also, stop using IAM Users for applications. Use IAM Roles. If your code is running on EC2, give the EC2 instance an Instance Profile. The AWS SDK will automatically pick up the temporary credentials. No keys to leak. No secrets to rotate. It’s one of the few things AWS actually got right.
The Managed Service Trap
AWS loves to tell you to use their managed services. “Don’t run Postgres on EC2, use RDS!” “Don’t run Kafka, use MSK!” “Don’t run Redis, use ElastiCache!”
The pitch is that they handle the backups, the patching, and the high availability. And they do. But they also lock you in. RDS Postgres is mostly Postgres, but you don’t have superuser access. You can’t install certain extensions. You can’t tweak the underlying OS parameters. And you pay a 30-50% premium for the privilege.
I’ve seen companies spend $50,000 a month on RDS because they were too scared to manage their own DB nodes. That’s fine if you have the margin. But if you’re a high-traffic, low-margin business, that “convenience tax” is the difference between profit and loss.
The “what is” of managed services is often just a wrapper around EC2. When you launch an RDS instance, AWS is just spinning up an EC2 instance in a hidden VPC that they manage, and then charging you extra for the automation scripts they wrote to manage it. Sometimes those scripts fail. I’ve had RDS instances get stuck in modifying state for 8 hours during a simple storage upgrade. When that happens, you’re at the mercy of AWS Support. And unless you’re paying for the $15,000/month Enterprise Support plan, you’re going to be waiting a long time.
The “Real World” Gotchas
If you’re going to survive in AWS, you need to know the things that aren’t in the “Getting Started” guides. Here are a few that have cost me sleep:
- The us-east-1 Curse: This is the oldest region. It’s where all the new features land first. It’s also the most unstable. When AWS has a global outage, it almost always starts in
us-east-1. If you can, useus-west-2(Oregon) oreu-central-1(Frankfurt). They are generally more stable. - CloudWatch Logs Cost: Ingesting logs into CloudWatch is expensive ($0.50 per GB in some regions). If your app is chatty and logging every SQL query, your logging bill might end up higher than your compute bill. Set retention policies! By default, CloudWatch logs are kept “Forever.” That’s a lot of dead data you’re paying for.
- EBS Snapshots: They are incremental, which is cool. But if you delete the original volume, the snapshots remain. Over years, these “orphaned” snapshots can accumulate and cost thousands. I once found 40TB of snapshots from 2018 in a client’s account.
- The “Default” VPC: Every account comes with one. Delete it. Or at least don’t use it. It’s a security risk because it’s designed for ease of use, not security. It has public subnets and open routing by default.
- Soft Limits: AWS has limits on everything. How many instances you can run, how many VPCs you can have, how many API calls you can make per second. These limits are “soft,” meaning you can ask to increase them, but it takes time. If you plan to scale from 10 to 1,000 nodes for a Black Friday sale, you better ask for that limit increase three weeks in advance.
The Billing Dashboard is Your Best Friend (and Worst Enemy)
You cannot understand “what is aws” without understanding the Cost Explorer. It is the only source of truth. You should be looking at it daily.
One of the most effective things I ever did as an SRE was setting up a Lambda function that posted our daily spend to a Slack channel. When the team saw that a “quick test” of a SageMaker notebook cost us $400 in a day, they started being a lot more careful. AWS makes it too easy to spend money. There is no “Are you sure? This will cost $5/hour” confirmation box. There is just an aws ec2 run-instances command and a bill at the end of the month.
Note to self: Check for unattached Elastic IPs. AWS charges you when they are NOT attached to a running instance. It’s their way of discouraging IP squatting, but it’s an easy way to leak $50/month for nothing.
Troubleshooting the “Black Box”
When things go wrong in AWS, they go wrong in ways that are hard to debug. You don’t have access to the physical hardware, so you have to rely on the metrics AWS gives you.
If an EC2 instance is “Status Check Failed,” it usually means the underlying hardware died. You can’t “fix” it. You just stop and start the instance, which forces it to move to a new physical host. This is why your architecture must be “disposable.” If you are afraid to terminate an instance, you aren’t doing cloud right. You’re just doing “someone else’s data center” wrong.
If you’re seeing packet loss, check your ethtool stats. You might be hitting the network limits of your instance size. AWS throttles bandwidth and PPS (packets per second) on smaller instances. You won’t see this in CloudWatch. You have to see it in the OS.
# Checking for network drops on an EC2 instance
ethtool -S eth0 | grep drop
# If you see 'bw_in_allowance_exceeded', you need a bigger instance.
The Verdict
So, what is AWS? It’s a tool. It’s a very sharp, very expensive tool that can either help you build a global-scale application in weeks or bankrupt your startup in months. It is not a silver bullet. It is not a “set it and forget it” platform. It requires constant monitoring, aggressive cost management, and a healthy dose of skepticism toward their marketing materials.
Most people use AWS because “nobody ever got fired for buying IBM,” and AWS is the IBM of the 21st century. It’s the safe choice. But the safe choice comes with a complexity tax that will haunt your engineering team forever. If you’re going to use it, learn the primitives. Understand the networking. Master IAM. And for the love of everything holy, set up a billing alert before you do anything else.
Stop treating the cloud like a playground and start treating it like the high-stakes resource allocation problem it actually is.
Related Articles
Explore more insights and best practices: