INCIDENT REPORT #882-B: Why our migration to the ‘Modern Stack’ nearly bankrupted the company.
DATE: October 14, 2023
TO: Executive Steering Committee, Engineering Leads
FROM: Lead Architect (Recovery Team)
SUBJECT: Post-Mortem of “Project Phoenix” (The Great Migration)
TOTAL LOSS: $50,420,000 USD (Estimated)
TIMELINE: January 2022 – June 2023
Table of Contents
1. Incident Report Summary
Project Phoenix was supposed to be our “digital transformation.” We took a stable, boring, revenue-generating Node.js monolith and decided to “modernize” it. The goal was to move to a distributed microservices architecture using every buzzword that trended on Twitter in 2021.
By Q3 2022, the system was effectively dead. We were spending $450,000 a month on AWS bills for a system that handled 40% less traffic than the old monolith. The “Modern Stack” was a house of cards built on top of shifting sand. We saw 99th percentile latency spike from 200ms to 4,500ms. We lost $12M in direct sales during the Black Friday window because the “Event-Driven Architecture” resulted in a circular dependency that deadlocked our entire inventory database.
The following report outlines the technical idiocy that led to this failure. I have spent the last 18 months gutting the “clever” code and replacing it with things that actually work. If you are looking for a “javascript best” practices guide that involves 15 layers of abstraction, close this tab. This is about survival.
2. The Microservices Mirage and the Distributed Monolith
The first mistake was the “Microservices First” mandate. We took a perfectly functional business logic layer and sliced it into 42 separate repositories. Each service had its own boilerplate, its own CI/CD pipeline, and its own set of bugs.
We were told this would “decouple” the teams. Instead, it created a distributed monolith where no single service could run without six others being online. We replaced a local function call—which takes nanoseconds—with an HTTP/2 request that took 50ms, plus DNS lookup, plus TLS handshake, plus the overhead of the “Service Mesh.”
The Over-Engineered Mess:
// service-inventory/src/middleware/auth-wrapper.ts
// This was repeated in 42 services.
import { AuthService } from '@company/internal-sdk';
export const validateRequest = async (req, res, next) => {
try {
const token = req.headers.authorization;
// Every single internal request triggered an external API call
// to an Auth service that was already struggling.
const user = await AuthService.verify(token);
req.user = user;
next();
} catch (e) {
res.status(401).json({ error: 'Unauthorised' });
}
};
The Correction:
We moved back to a modular monolith. We used a simple, shared library for JWT verification that didn’t require a network hop for every single internal call.
// core/auth.js
const jwt = require('jsonwebtoken');
const PUBLIC_KEY = process.env.JWT_PUBLIC_KEY;
// Simple, synchronous, fast.
exports.verifyToken = (token) => {
return jwt.verify(token, PUBLIC_KEY, { algorithms: ['RS256'] });
};
We stopped pretending we were Google. We don’t have 10,000 engineers. We have 50. A monolith is not a “legacy” pattern; it is a “we want to actually ship features” pattern.
3. The Barrel File Bottleneck and the Node.js Startup Death Spiral
By Q4 2022, our Lambda functions were timing out before they even started executing code. We were using Node.js v20.11.1, and our “Cold Start” times were exceeding 10 seconds. Why? Because some “Senior” developer thought it would be “clean” to use barrel files (index.ts) for everything.
Every time we imported one utility function, Node.js had to crawl the entire file tree, parsing thousands of lines of code that weren’t even being used. V8 spent more time in the Loading phase than the Executing phase.
The Terminal Log of Shame:
$ node --trace-event-categories v8,node.module_timer index.js
[2.451s] node:internal/modules/cjs/loader:452: load_request_event_start
[8.922s] node:internal/modules/cjs/loader:890: load_request_event_end
# Total startup time: 9.2 seconds.
# 85% of time spent parsing unused barrel files.
The “javascript best” approach according to the “clean code” crowd is to have an index.ts in every folder. In reality, this is a performance suicide note.
The Correction:
We banned barrel files. We imported exactly what we needed from the specific file.
Before:
import { formatDate } from '@/utils'; // This imports 400 other utilities and 20 heavy libraries.
After:
import { formatDate } from '@/utils/date-formatter.js'; // Imports exactly 15 lines of code.
Startup time dropped from 9 seconds to 400ms. Stop making the V8 engine do work it doesn’t need to do.
4. Memory Leaks and the V8 Garbage Collector Nightmare
In March 2023, the production API started crashing every 4 hours. We looked at the metrics. Memory usage was a literal staircase. We were hitting the 4GB heap limit on our containers and getting OOMKilled.
The culprit? A “clever” use of WeakMap and Proxy objects to build a “reactive” caching layer that was supposed to “automatically” invalidate data. The developers thought they were being smart. They were actually just preventing the V8 Garbage Collector from doing its job.
The Heap Snapshot Analysis:
Snapshot 1: 150MB
Snapshot 2: 850MB (after 1 hour)
Snapshot 3: 2.4GB (after 2 hours)
Top Retainers:
(array) @123456 - 45% of heap
-> Proxy @78901
-> Map @11223
-> "massive_api_response_string"
The V8 engine’s “Scavenger” (Young Generation GC) was running every 2 seconds, consuming 30% of the CPU, trying to find memory it could free. But because of the circular references in the Proxy-based cache, the “Mark-Sweep-Compact” (Old Generation GC) couldn’t reclaim anything.
The “Clever” (Broken) Code:
const cache = new WeakMap();
// This was supposed to "magically" clean up.
// It didn't, because the keys were objects that never went out of scope.
export const getCachedData = (keyObject, fetcher) => {
if (cache.has(keyObject)) return cache.get(keyObject);
const data = fetcher();
const proxy = new Proxy(data, {
get(target, prop) {
console.log(`Accessing ${prop}`);
return target[prop];
}
});
cache.set(keyObject, proxy);
return proxy;
};
The Correction:
We deleted the “reactive” cache. We used a standard LRU (Least Recently Used) cache with a fixed size.
const { LRUCache } = require('lru-cache');
const cache = new LRUCache({ max: 500, ttl: 1000 * 60 * 5 });
export const getCachedData = async (key, fetcher) => {
const cached = cache.get(key);
if (cached) return cached;
const data = await fetcher();
cache.set(key, data);
return data;
};
The memory staircase vanished. The CPU usage dropped by 40%. Don’t try to outsmart the V8 engine. It has been optimized by people much smarter than you. If you use a Proxy in a high-throughput path, you are probably making a mistake.
5. The any Pandemic and the TypeScript Lie
We were told TypeScript would make the code “safe.” Instead, the team used any every time they encountered a slightly complex type. We had 4,200 instances of any in the codebase. It was just JavaScript with extra steps and longer compile times.
The worst part was the “Type Casting” of API responses. We were casting raw JSON from external services directly into interfaces without any validation.
The Disaster Code:
interface UserProfile {
id: string;
email: string;
settings: {
theme: 'light' | 'dark';
};
}
const loadProfile = async (id: string): Promise<UserProfile> => {
const response = await fetch(`/api/users/${id}`);
const data = await response.json();
return data as UserProfile; // The "Lie"
};
// Somewhere else in the code...
const profile = await loadProfile('123');
console.log(profile.settings.theme); // CRASH: Cannot read property 'theme' of undefined
When the external API changed its response format, the entire frontend and backend crashed because we trusted the “type.” This cost us $2M in lost orders over a single weekend.
The Correction:
We enforced a strict “No any” rule in tsconfig.json. We introduced Zod for runtime validation. If the data doesn’t match the schema at the boundary, the system fails gracefully with a logged error, rather than exploding deep in the business logic.
import { z } from 'zod';
const UserProfileSchema = z.object({
id: z.string(),
email: z.string().email(),
settings: z.object({
theme: z.enum(['light', 'dark']),
}),
});
const loadProfile = async (id: string) => {
const response = await fetch(`/api/users/${id}`);
const data = await response.json();
const result = UserProfileSchema.safeParse(data);
if (!result.success) {
throw new Error('Invalid API response');
}
return result.data;
};
TypeScript is not a replacement for runtime validation. If you aren’t validating your boundaries, your types are just documentation that can—and will—lie to you.
6. Redux, Deep Cloning, and the State Management Bloat
In our React frontend, the state management was a dumpster fire. The “javascript best” practice at the time was to keep “everything” in a single Redux store. We had a global state object that was 12MB of nested JSON.
Every time a user typed a single character into a search box, we were running a reducer that performed a deep clone of that 12MB object using the spread operator.
The Performance Killer:
// reducer.js
case 'UPDATE_SEARCH_TERM':
return {
...state, // 12MB of data being shallow copied
search: {
...state.search,
term: action.payload
}
};
On a mid-range mobile device, this caused a 150ms “jank” on every keystroke. The UI felt like it was stuck in molasses. V8’s garbage collector was working overtime to clean up the thousands of discarded state objects.
The Correction:
We moved UI-specific state (like search terms) into local component state. For the global state, we stopped the deep-nesting madness. We normalized the data.
// Instead of a giant tree, use flat objects (entities)
const [searchTerm, setSearchTerm] = useState('');
// Reducer only handles what it needs
case 'UPDATE_ENTITY':
return {
...state,
entities: {
...state.entities,
[action.id]: action.data
}
};
We also stopped using Redux for server-side data and moved to TanStack Query. This reduced our total “State Management” code by 70%.
7. Dependency Hell and the 2GB node_modules Folder
By the end of the migration, our package-lock.json was 45,000 lines long. We had 1,400 dependencies. Every time we ran npm install (using npm v10.2.4), it was a gamble.
One morning, the build pipeline failed because a sub-dependency of a sub-dependency of a “pretty-logger” library had been unpublished. The entire company stopped for 6 hours.
The npm audit Reality Check:
$ npm audit
# 124 vulnerabilities found (32 high, 12 critical)
# Run `npm audit fix` to... [DO NOT DO THIS, IT WILL BREAK EVERYTHING]
We were using heavy libraries for things that Node.js now does natively. We had request, axios, and node-fetch all in the same project because different “microservices” used different templates.
The Correction:
We audited every single dependency. If it could be done with the Node.js standard library, we deleted the dependency.
- Deleted
moment.js: Replaced withIntl.DateTimeFormat. - Deleted
lodash: Replaced with nativeArray.map,filter,reduce, andObject.entries. - Deleted
axios: Replaced with the nativefetchAPI (available in Node 20). - Deleted
chalk: Replaced with native ANSI escape codes for the few logs that actually needed color.
We reduced the node_modules size from 2.2GB to 180MB. The build time went from 12 minutes to 2 minutes.
8. The Cost of “Clever” Code
The underlying theme of this $50M disaster was the desire for developers to feel “clever.”
They used Reflect.metadata for dependency injection in a project that didn’t need it. They used AsyncLocalStorage to pass context through 15 layers of middleware, making debugging a nightmare because the stack traces were incomprehensible. They used InversifyJS to solve a problem that simple constructor injection would have solved in three lines of code.
The “Clever” DI Mess:
@injectable()
class OrderProcessor {
constructor(@inject(TYPES.Database) private db: IDatabase) {}
// ...
}
The Sane Reality:
class OrderProcessor {
constructor(db) {
this.db = db;
}
}
const processor = new OrderProcessor(databaseInstance);
The latter is readable, testable, and doesn’t require a PhD in Decorators to understand.
Conclusion: What “javascript best” Actually Means
After 18 months of cleaning up this mess, I can tell you what “javascript best” practices actually look like. They aren’t flashy. They don’t get you 1,000 likes on LinkedIn.
- Prefer Simplicity over Abstraction: If you can solve it with a function, don’t use a class. If you can solve it with a class, don’t use a framework.
- Validate at the Boundaries: Use Zod or Joi. Trust nothing that comes from an API or a database.
- Monitor the V8 Engine: If your memory usage is climbing, you have a leak. Don’t just increase the RAM in your Dockerfile. Find the leak.
- Avoid Hype-Driven Development: Just because a new library is trending doesn’t mean it’s ready for production.
- Keep the Dependency Tree Lean: Every dependency is a potential security hole and a maintenance burden.
- Monoliths are Fine: Unless you are at a scale where you literally cannot store the code in one git repo, you probably don’t need microservices.
We are finally back in the black. The system is stable. The latency is low. But we paid $50M to learn a lesson we should have already known: Code is a liability, not an asset. The less of it you have, the better off you are.
Now, if you’ll excuse me, I have to go delete another 5,000 lines of “clever” code.
SIGNED,
A Very Tired Architect
(Node.js v20.11.1, npm v10.2.4, V8 Engine survivor)
Related Articles
Explore more insights and best practices: