Seventeenth in a series about migrating from legacy architectures to a modern Nuxt 4 stack.
The Inconvenient Truth About Node.js Servers
Node.js is optimized for event-driven I/O, not for long-lived servers that render thousands of pages per hour. Over time, the V8 heap grows and objects such as GraphQL responses, Vue server renderer allocations, cached strings, and Apollo Client instances accumulate. Without intervention, a production process will eventually consume all available memory and get killed by the container orchestrator.
That is not a bug to eliminate so much as a reality to manage. The real question is not whether memory will approach its limit, but how gracefully the system will handle it.
PM2 Cluster Mode: Zero-Downtime Worker Management
In a large enterprise application, instead of a single Node.js process, PM2 typically runs N worker processes — often 2–3 per container. Each worker handles requests independently, which provides two critical benefits:
- Fault isolation — if one worker crashes or becomes unresponsive, the others keep serving requests
- Rolling restarts — when a worker approaches its memory limit, PM2 restarts it while the other workers continue handling traffic
flowchart TB
subgraph C["Container (2 vCPU, 4 GiB RAM)"]
direction TB
M[PM2 Master Process]
subgraph W1[Worker 1]
direction TB
H1[V8 Heap\n~1.5 GiB\nmax-old-space-size=1536]
R1[Handles requests\nindependently]
end
subgraph W2[Worker 2]
direction TB
H2[V8 Heap\n~1.5 GiB\nmax-old-space-size=1536]
R2[Handles requests\nindependently]
end
end
M --- W1
M --- W2
When Worker 1 approaches 1,536 MB of heap usage, PM2 restarts it. Worker 2 handles traffic during the restart, which typically takes 2–3 seconds for V8 to compile the Nuxt application. For that worker, downtime lasts a few seconds. For the overall application, it is effectively zero.
V8 Heap Cap: Trading Throughput for Predictability
By default, V8 uses a dynamic heap limit that grows based on available system memory. In containerized environments, that behavior is risky — V8 can grow beyond the container’s memory allocation and trigger an OOM kill.
Setting an explicit heap limit forces more aggressive garbage collection:
NODE_OPTIONS=--max-old-space-size=1536
Effect:
Without cap: GC runs infrequently → heap grows to 3+ GiB → OOM kill
With cap: GC runs at ~1.2 GiB → heap stays under 1.5 GiB → stable
flowchart LR
A[Start] --> B[No explicit V8 heap cap]
B --> C["Heap grows with available memory\n> 3 GiB in container"]
C --> D[Container OOM kill]
A --> E[Set --max-old-space-size=1536]
E --> F[GC runs around 1.2 GiB]
F --> G["Heap stays <= 1.5 GiB"]
G --> H[Process stable\nSlightly lower peak throughput]
The trade-off is straightforward: more frequent GC pauses of 2–5 ms each reduce peak throughput by about 5%. But the process never gets OOM-killed, which is a far better outcome in production.
Memory Is the Scaling Bottleneck
Load testing for a typical Nuxt SSR frontend in a large SaaS or e-commerce platform reveals something counterintuitive: the Nuxt SSR application is often I/O-bound, not CPU-bound.
flowchart TB
subgraph RU[Resource Usage Under Load]
CPU[CPU peak ~12%]
MEM[Memory peak ~60%]
BOT[Bottleneck: Memory, not CPU]
end
CPU --> BOT
MEM --> BOT
SSR mostly waits for backend responses (for example, GraphQL or REST APIs) and renders HTML — I/O work that barely touches the CPU. But each in-flight request still holds response objects, VNode trees, and serialization buffers in memory. Under load, dozens of concurrent requests holding a few hundred kilobytes each add up quickly.
This means:
- Over-provisioning CPU wastes money — you pay for compute that sits idle
- Under-provisioning memory crashes the server — V8 heap exhaustion triggers cascading failures
- A good starting ratio is roughly 1 vCPU : 2 GiB RAM for SSR workloads
The Right-Sizing Experiment
In a representative production-like environment, you can right-size Node.js SSR containers by running load tests with different resource configurations:
| Configuration | Result |
|---|---|
| 4 vCPU / 8 GiB | Stable but over-provisioned |
| 2 vCPU / 4 GiB | Stable and efficient ✓ |
| 1 vCPU / 2 GiB | Cascading failures |
At 1 vCPU / 2 GiB, workers ran at 1,791 MB out of 2,048 MB — V8 was at its ceiling. Health probes timed out because the event loop was blocked by GC. The orchestrator restarted replicas, but cold-starting Nuxt takes several seconds because V8 must compile the application. During that window, the remaining replicas were overloaded, which caused them to fail health checks. The cascade continued until manual intervention.
sequenceDiagram
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
participant O as Orchestrator
Note over R1: Memory 1791/2048 MB<br/>GC stalls<br/>Health probe timeout
O->>R1: Mark unhealthy
O->>R1: Restart replica
Note over R1: Cold start (~3s)<br/>No traffic handling
Note over R2: Now handling 2× traffic<br/>Memory spike
O->>R2: Health probe timeout
O->>R2: Restart replica
Note over R3: Now handling 3× traffic<br/>Immediate failure
O->>R3: Restart replica
Note over R1,R3: All replicas restarting<br/>Zero capacity for ~10 seconds
In practice, the minimum viable per-replica compute for V8 startup plus Nuxt SSR in such an environment is about 2 vCPU / 4 GiB. Going below that introduces a cascading failure risk that replica count alone cannot absorb.
Minimum Replicas: Preventing Cold-Start Cascades
Even with correctly sized replicas, starting from too few creates problems under load. The orchestrator can launch new replicas, but each one needs time to start, compile, and begin accepting requests.
For example, with 2 replicas scaling to 15, the first traffic burst hits only 2 instances. They overload while new replicas spin up. By the time those are ready, the original 2 may already have failed.
The fix is to set minReplicas high enough to handle average production traffic without scaling out. In a typical large-scale web application, values might look like this:
| Service | minReplicas | maxReplicas | Reasoning |
|---|---|---|---|
| SSR SPA | 5 | 20 | Handles page rendering (heaviest) |
| API | 3 | 20 | Handles business logic (lighter) |
flowchart LR
TRAF[Average production traffic] -->|First burst| R5[5 pre-warmed SSR SPA replicas]
R5 --> CAP["Within capacity<br/>No scale-out needed"]
TRAF -->|Genuine spike| SO[Autoscaler triggers scale-out]
SO --> N[New replicas starting\nNuxt compile + V8 startup]
R5 --> BUF[Existing 5 replicas buffer traffic]
N --> READY[New replicas ready\nTraffic distributed]
At 5 pre-warmed SPA replicas, normal production traffic stays within capacity and does not trigger scaling. Scale-out only activates for genuine spikes, and the existing 5 replicas buffer traffic while new ones start.
Health Monitoring
The application exposes a health endpoint that returns per-worker metrics, enabling the orchestrator and internal tools to see exactly what PM2 workers are doing:
GET /api/health/pm2
Response:
{
"workers": [
{
"id": 0,
"cpu": 8.2,
"memory": 1234567890,
"restarts": 3,
"uptime": 86400000,
"status": "online"
},
{
"id": 1,
"cpu": 5.1,
"memory": 987654321,
"restarts": 1,
"uptime": 72000000,
"status": "online"
}
]
}
The endpoint is protected by an internal API guard — it returns 404 for any caller that is not a health probe or internal service with the correct authorization header. External callers cannot even discover that it exists.
The Validated Configuration
After extensive load testing in a realistic production scenario, a configuration like the following has proven to pass all thresholds:
Per Container:
CPU: 2 vCPU
Memory: 4 GiB
PM2: 2 workers per container
V8: --max-old-space-size=1536 per worker
Scaling:
SPA: min 5, max 20 replicas
API: min 3, max 20 replicas
Result at 6× production load:
Median response time: 165 ms
Error rate: 0.82%
CPU peak: 12% of allocation
Memory peak: 60% of allocation
flowchart TB
subgraph PC[Per Container]
CPU[CPU: 2 vCPU]
MEM[Memory: 4 GiB]
PM2W[PM2: 2 workers per container]
V8[V8: --max-old-space-size=1536 per worker]
end
subgraph SC[Scaling]
SPA[SPA: min 5, max 20 replicas]
API[API: min 3, max 20 replicas]
end
subgraph RES[Result at 6× production load]
RT[Median response time: 165 ms]
ER[Error rate: 0.82%]
CPUU[CPU peak: 12% of allocation]
MEMU[Memory peak: 60% of allocation]
end
PC --> SC --> RES
Lessons Learned
Node.js is not a “fire and forget” runtime
Unlike compiled languages with deterministic memory management, Node.js requires active memory management for long-lived processes. V8 heap caps, PM2 restarts, and minimum replica sizing are not optimizations — they are necessities.
Size for memory, not CPU
SSR workloads are I/O-bound. The CPU spends most of its time waiting for backend responses. Provision memory generously and CPU conservatively. A 1:2 vCPU:GiB ratio is a solid starting point.
Cold starts are the hidden enemy of auto-scaling
Auto-scaling sounds effortless until you realize new replicas take several seconds to become productive. During that window, existing replicas have to absorb the load. If they cannot, cascading failures follow. Adequate minReplicas removes that risk.
Load test the validated configuration, not the ideal one
It is tempting to load test with generous resources and right-size later. But right-sizing can expose failure modes that do not exist at larger sizes. Always load test the production configuration, not a more generous version.
What’s Next
- Article 11: Multi-Environment Infrastructure — Azure Container Apps and the Configuration System — Managing three environments with generated configuration.
- Article 12: Security in a Nuxt SSR App — CSRF, Azure AD, CSP, and More — The security layers that protect a server-rendered application.
- Article 13: Observability and Distributed Tracing — Application Insights End-to-End — How every request is traced from the reverse proxy through the application to the backend.
Munir Husseini is a software architect specializing in full-stack TypeScript, .NET, and cloud-native architectures.
Leave a Reply