POST-MORTEM REPORT: INCIDENT #4092 – GATEWAY OOM COLLAPSE
DATE: 2024-10-24
PREPARED BY: Senior Systems Engineer (Embedded/Firmware Dept.)
SUBJECT: THE CATASTROPHIC FAILURE OF ABSTRACTION: WHY THE PYTHON LIST IS NOT AN ARRAY
Table of Contents
0. THE SMOKING GUN: TERMINAL LOGS
I was woken up at 03:15 by the pager because the production gateway—a machine with 16GB of ECC RAM that should be handling telemetry for a decade without a reboot—decided to commit ritual suicide. Here is what the logs show right before the kernel OOM killer stepped in to put your “modern” Python service out of its misery.
[2024-10-24 03:14:02] DEBUG: Ingesting sensor_batch_id=992834
[2024-10-24 03:14:05] TRACE: list.append(sensor_data)
[2024-10-24 03:14:08] ERROR: Internal Python Memory Allocation Failure
python3.12(12345,0x7fff) malloc: *** error for object 0x105604000: pointer being freed was not allocated
Fatal Python error: _PyObject_GC_Alloc: Could not allocate memory
Python runtime state: initialized
Current thread 0x00007fff (most recent call first):
File "/opt/gateway/ingest.py", line 42 in process_telemetry
File "/opt/gateway/main.py", line 12 in <module>
Memory Investigation:
>>> import sys
>>> data_points = [i for i in range(1000000)]
>>> print(f"List object size: {sys.getsizeof(data_points)} bytes")
List object size: 8000056 bytes
>>> print(f"Actual memory usage: {sum(sys.getsizeof(i) for i in data_points) + sys.getsizeof(data_points)} bytes")
Actual memory usage: 36000056 bytes
>>> print(f"Address of first element: {id(data_points[0])}")
Address of first element: 4302142128
>>> print(f"Address of second element: {id(data_points[1])}")
Address of second element: 4302142160
>>> # Difference: 32 bytes. Not 4. Not 8. 32.
You thought you were just “adding items to a list.” In reality, you were building a fragmented skyscraper of pointers on a foundation of sand.
1. THE FRAUD OF THE CONTIGUOUS BUFFER
In C, when I declare uint32_t buffer[1024];, I know exactly where my data is. I have 4096 bytes of contiguous physical memory. I can calculate the offset of any element with a single addition to the base pointer. The CPU’s prefetcher loves me. The L1 cache is my friend.
In Python 3.12.5, a list is not an array. It is a PyListObject, a heap-allocated structure that points to another heap-allocated array of pointers, each of which points to yet another heap-allocated object.
Look at the PyListObject definition in Include/cpython/listobject.h:
typedef struct {
PyObject_VAR_HEAD
PyObject **ob_item; // A pointer to an array of pointers.
Py_ssize_t allocated;
} PyListObject;
When you “append” to a list, you aren’t moving data. You are manipulating a pointer to a pointer. This is a double-indirection nightmare. To access my_list[5], the CPU must:
1. Load the address of the PyListObject.
2. Dereference ob_item to find the array of pointers.
3. Calculate the offset (index * 8 bytes on our 64-bit gateway).
4. Load the address stored at that offset.
5. Dereference that address to finally reach the PyObject (like a PyLongObject).
Compare this to C:
// C Code: Direct access. 1 Load instruction.
uint32_t val = buffer[5];
// Python "List" Access:
// 1. Load list object
// 2. Load ob_item pointer
// 3. Load pointer at ob_item[5]
// 4. Load value from PyObject->ob_digit
You are forcing the CPU to stall on every single access while it waits for the memory controller to fetch non-contiguous addresses. You aren’t writing software; you’re writing a manual for cache misses.
2. THE PYOBJECT TAX: PAYING FOR AIR
You juniors love to talk about “clean code” while ignoring the fact that your “clean” Python integers are 28 bytes long.
In C, a 32-bit integer is 4 bytes. Period.
In Python 3.12, every single integer in that list is a full-blown PyObject.
import sys
x = 42
print(sys.getsizeof(x)) # Result: 28
Why 28 bytes? Because of the PyObject_HEAD. Every object needs a reference count (ob_refcnt, 8 bytes) and a pointer to its type object (ob_type, 8 bytes). Then, because Python integers are arbitrary precision, it needs a size field and then the actual digits.
When you created that list of 1,000,000 sensor readings, you didn’t use 4MB of RAM (like a C uint32_t array would). You used:
– 8MB for the ob_item pointer array.
– 28MB for the 1,000,000 PyLongObject instances.
– Total: 36MB.
You inflated the memory footprint by 900% just by choosing a list over a proper buffer. On a gateway handling 100 concurrent streams, that’s the difference between “running smoothly” and “kernel panic at 3 AM.”
3. LIST_RESIZE AND THE ARITHMETIC OF FAILURE
The reason the gateway crashed during the “ingestion” phase is the list_resize function in Objects/listobject.c. Python lists are over-allocated to provide $O(1)$ amortized append time. But “amortized” is a word used by people who don’t care about deterministic latency.
Let’s look at the growth algorithm in the CPython source:
// Simplified logic from Objects/listobject.c: list_resize
new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);
When the list exceeds its current capacity, Python calls realloc(). The growth pattern for a list starting at 0 elements looks like this:
0, 4, 8, 16, 25, 35, 46, 58, 72, 88...
Every time that resize happens, the entire array of pointers (ob_item) might be moved in memory. If your heap is fragmented—which it is, because Python’s garbage collector is constantly churning through small objects—realloc can’t just extend the block. It has to find a new hole, copy all the pointers, and then free the old block.
During your “ingest” loop, the gateway was spending 40% of its CPU cycles just moving pointers around because you didn’t pre-allocate the buffer. In C, I would have used malloc(count * sizeof(uint32_t)) once. You used realloc dozens of times, fragmenting the heap until there wasn’t a single contiguous block of 8MB left, even though 12GB of RAM was “free.”
4. POINTER INDIRECTION HELL AND CACHE LOCALITY
Modern CPUs are designed to stream data from memory. They use L1, L2, and L3 caches to hide the massive latency of DRAM. A C array of integers is a beautiful, contiguous stream of bits. The CPU sees you reading index 0, 1, 2 and it pre-fetches 3, 4, 5, 6 into the L1 cache before you even ask for them.
A Python list is a map of a minefield.
Because each PyObject is allocated separately on the heap, my_list[0] might be at memory address 0x1000, and my_list[1] might be at 0x5000.
# Real-world addresses from the crash dump
# id(list[n])
4302142128 # n=0
4302142160 # n=1
4302142192 # n=2
...
4305821040 # n=1000 (Who knows where this is in physical RAM?)
When you iterate over that list to calculate a rolling average, the CPU’s branch predictor and prefetcher are useless. Every single element access is a potential cache miss. You are forcing the CPU to wait hundreds of nanoseconds for the RAM to respond, over and over again. This is why the gateway’s CPU usage was at 100% while it was doing “simple” math. It wasn’t calculating; it was waiting for the bus.
5. THE GARBAGE COLLECTOR’S FALSE PROMISES
You thought the Garbage Collector (GC) would save you. You thought that when the process_telemetry function finished, the memory would be returned to the system.
Wrong.
Python 3.12 uses reference counting as its primary mechanism, supplemented by a generational cyclic garbage collector. When you have a list of a million objects, and you delete that list, Python has to decrement the reference count of every single one of those million objects.
If those objects have a reference count of zero, Python then has to call the deallocator for each one. This isn’t a single free() call. This is a million calls to PyObject_Free.
Furthermore, because of the way the C malloc implementation (usually ptmalloc or mimalloc) works, freeing memory doesn’t necessarily return it to the OS. It returns it to Python’s internal “arenas.” Your process RSS (Resident Set Size) stays high, the OOM killer sees a bloated process, and it kills the gateway.
In C, I manage the lifetime. I know when the buffer is dead. I free() the buffer. One call. Done. The memory is available for the network stack immediately.
6. WHY YOUR “CLEAN CODE” IS HARDWARE SABOTAGE
The script that crashed the gateway used a list of dictionaries. A list… of dictionaries.
# The "Naive" Script that killed us
telemetry_data = []
for i in range(MAX_SENSORS):
telemetry_data.append({
'id': i,
'val': get_reading()
})
Let’s do the math on the overhead for a single entry in that list:
1. Pointer in the list: 8 bytes.
2. PyDictObject: ~240 bytes (minimum size for a dict).
3. Key “id”: ~50 bytes (string object).
4. Value (int): 28 bytes.
5. Key “val”: ~50 bytes (string object).
6. Value (float): 24 bytes.
Each “reading” that should take 8 bytes (4-byte ID, 4-byte float) is taking roughly 400 bytes. That is a 50x multiplier in memory waste. You are treating a resource-constrained edge gateway like it’s a serverless function with infinite RAM. It isn’t.
MANDATORY ACTION ITEMS
Effective immediately, the following rules are enforced for all gateway-level code. If I see a raw Python list being used for high-frequency data ingestion again, I will personally revoke your SSH access.
- BANISH THE LIST FOR RAW DATA: For any collection of more than 1,000 numeric elements, you will use
array.arrayornumpy.array.array.array('I', ...)in Python stores raw C integers. It uses 4 bytes per element. It is contiguous. It doesn’t kill the L1 cache.
- PRE-ALLOCATE OR DIE: If you know you are going to receive 10,000 packets, you do not
append(). You initialize the array to the required size:data = [0] * 10000. This triggers a singlelist_resizecall instead of dozens. - USE SLOTS: If you must use objects, you will use
__slots__to prevent the creation of a__dict__for every instance. This saves 100+ bytes per object. - STRUCT MODULE: Use the
structmodule to pack data into binary buffers before storing or transmitting. Stop passing dictionaries over the internal bus. - MEMORY PROFILING: No code will be merged without a
memory_profilertrace showing the peak RSS during a simulated 24-hour load. - C EXTENSIONS: If the throughput exceeds 10k msg/s, the ingestion logic will be moved to a C extension or a Rust crate. Python is for orchestration, not for bit-shoveling.
I have spent twenty years making sure our hardware doesn’t fail. I will not let a “convenient” Python list undo that work because you were too lazy to think about a pointer.
Read Objects/listobject.c. Read it until you understand why we are in this mess.
— The Management (Engineering Dept.)
Related Articles
Explore more insights and best practices: