Memory, Stability, and PM2 — Running a Long-Lived Node.js Server

Seventeenth in a series about migrating from legacy architectures to a modern Nuxt 4 stack.

The Inconvenient Truth About Node.js Servers

Node.js is optimized for event-driven I/O, not for long-lived servers that render thousands of pages per hour. Over time, the V8 heap grows and objects such as GraphQL responses, Vue server renderer allocations, cached strings, and Apollo Client instances accumulate. Without intervention, a production process will eventually consume all available memory and get killed by the container orchestrator.

That is not a bug to eliminate so much as a reality to manage. The real question is not whether memory will approach its limit, but how gracefully the system will handle it.

PM2 Cluster Mode: Zero-Downtime Worker Management

In a large enterprise application, instead of a single Node.js process, PM2 typically runs N worker processes — often 2–3 per container. Each worker handles requests independently, which provides two critical benefits:

Fault isolation — if one worker crashes or becomes unresponsive, the others keep serving requests
Rolling restarts — when a worker approaches its memory limit, PM2 restarts it while the other workers continue handling traffic

flowchart TB
    subgraph C["Container (2 vCPU, 4 GiB RAM)"]
        direction TB
        M[PM2 Master Process]

        subgraph W1[Worker 1]
            direction TB
            H1[V8 Heap\n~1.5 GiB\nmax-old-space-size=1536]
            R1[Handles requests\nindependently]
        end

        subgraph W2[Worker 2]
            direction TB
            H2[V8 Heap\n~1.5 GiB\nmax-old-space-size=1536]
            R2[Handles requests\nindependently]
        end
    end

    M --- W1
    M --- W2

When Worker 1 approaches 1,536 MB of heap usage, PM2 restarts it. Worker 2 handles traffic during the restart, which typically takes 2–3 seconds for V8 to compile the Nuxt application. For that worker, downtime lasts a few seconds. For the overall application, it is effectively zero.

V8 Heap Cap: Trading Throughput for Predictability

By default, V8 uses a dynamic heap limit that grows based on available system memory. In containerized environments, that behavior is risky — V8 can grow beyond the container’s memory allocation and trigger an OOM kill.

Setting an explicit heap limit forces more aggressive garbage collection:

NODE_OPTIONS=--max-old-space-size=1536

Effect:
  Without cap:  GC runs infrequently → heap grows to 3+ GiB → OOM kill
  With cap:     GC runs at ~1.2 GiB → heap stays under 1.5 GiB → stable

flowchart LR
    A[Start] --> B[No explicit V8 heap cap]
    B --> C["Heap grows with available memory\n&gt; 3 GiB in container"]
    C --> D[Container OOM kill]

    A --> E[Set --max-old-space-size=1536]
    E --> F[GC runs around 1.2 GiB]
    F --> G["Heap stays &lt;= 1.5 GiB"]
    G --> H[Process stable\nSlightly lower peak throughput]

The trade-off is straightforward: more frequent GC pauses of 2–5 ms each reduce peak throughput by about 5%. But the process never gets OOM-killed, which is a far better outcome in production.

Memory Is the Scaling Bottleneck

Load testing for a typical Nuxt SSR frontend in a large SaaS or e-commerce platform reveals something counterintuitive: the Nuxt SSR application is often I/O-bound, not CPU-bound.

flowchart TB
    subgraph RU[Resource Usage Under Load]
        CPU[CPU peak ~12%]
        MEM[Memory peak ~60%]
        BOT[Bottleneck: Memory, not CPU]
    end

    CPU --> BOT
    MEM --> BOT

SSR mostly waits for backend responses (for example, GraphQL or REST APIs) and renders HTML — I/O work that barely touches the CPU. But each in-flight request still holds response objects, VNode trees, and serialization buffers in memory. Under load, dozens of concurrent requests holding a few hundred kilobytes each add up quickly.

This means:

Over-provisioning CPU wastes money — you pay for compute that sits idle
Under-provisioning memory crashes the server — V8 heap exhaustion triggers cascading failures
A good starting ratio is roughly 1 vCPU : 2 GiB RAM for SSR workloads

The Right-Sizing Experiment

In a representative production-like environment, you can right-size Node.js SSR containers by running load tests with different resource configurations:

Configuration	Result
4 vCPU / 8 GiB	Stable but over-provisioned
2 vCPU / 4 GiB	Stable and efficient ✓
1 vCPU / 2 GiB	Cascading failures

At 1 vCPU / 2 GiB, workers ran at 1,791 MB out of 2,048 MB — V8 was at its ceiling. Health probes timed out because the event loop was blocked by GC. The orchestrator restarted replicas, but cold-starting Nuxt takes several seconds because V8 must compile the application. During that window, the remaining replicas were overloaded, which caused them to fail health checks. The cascade continued until manual intervention.

sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant O as Orchestrator

    Note over R1: Memory 1791/2048 MB<br/>GC stalls<br/>Health probe timeout
    O->>R1: Mark unhealthy
    O->>R1: Restart replica
    Note over R1: Cold start (~3s)<br/>No traffic handling

    Note over R2: Now handling 2× traffic<br/>Memory spike
    O->>R2: Health probe timeout
    O->>R2: Restart replica

    Note over R3: Now handling 3× traffic<br/>Immediate failure
    O->>R3: Restart replica

    Note over R1,R3: All replicas restarting<br/>Zero capacity for ~10 seconds

In practice, the minimum viable per-replica compute for V8 startup plus Nuxt SSR in such an environment is about 2 vCPU / 4 GiB. Going below that introduces a cascading failure risk that replica count alone cannot absorb.

Minimum Replicas: Preventing Cold-Start Cascades

Even with correctly sized replicas, starting from too few creates problems under load. The orchestrator can launch new replicas, but each one needs time to start, compile, and begin accepting requests.

For example, with 2 replicas scaling to 15, the first traffic burst hits only 2 instances. They overload while new replicas spin up. By the time those are ready, the original 2 may already have failed.

The fix is to set minReplicas high enough to handle average production traffic without scaling out. In a typical large-scale web application, values might look like this:

Service	minReplicas	maxReplicas	Reasoning
SSR SPA	5	20	Handles page rendering (heaviest)
API	3	20	Handles business logic (lighter)

flowchart LR
    TRAF[Average production traffic] -->|First burst| R5[5 pre-warmed SSR SPA replicas]
    R5 --> CAP["Within capacity<br/>No scale-out needed"]

    TRAF -->|Genuine spike| SO[Autoscaler triggers scale-out]
    SO --> N[New replicas starting\nNuxt compile + V8 startup]
    R5 --> BUF[Existing 5 replicas buffer traffic]
    N --> READY[New replicas ready\nTraffic distributed]

At 5 pre-warmed SPA replicas, normal production traffic stays within capacity and does not trigger scaling. Scale-out only activates for genuine spikes, and the existing 5 replicas buffer traffic while new ones start.

Health Monitoring

The application exposes a health endpoint that returns per-worker metrics, enabling the orchestrator and internal tools to see exactly what PM2 workers are doing:

GET /api/health/pm2
Response:
{
  "workers": [
    {
      "id": 0,
      "cpu": 8.2,
      "memory": 1234567890,
      "restarts": 3,
      "uptime": 86400000,
      "status": "online"
    },
    {
      "id": 1,
      "cpu": 5.1,
      "memory": 987654321,
      "restarts": 1,
      "uptime": 72000000,
      "status": "online"
    }
  ]
}

The endpoint is protected by an internal API guard — it returns 404 for any caller that is not a health probe or internal service with the correct authorization header. External callers cannot even discover that it exists.

The Validated Configuration

After extensive load testing in a realistic production scenario, a configuration like the following has proven to pass all thresholds:

Per Container:
  CPU:     2 vCPU
  Memory:  4 GiB
  PM2:     2 workers per container
  V8:      --max-old-space-size=1536 per worker

Scaling:
  SPA: min 5, max 20 replicas
  API: min 3, max 20 replicas

Result at 6× production load:
  Median response time: 165 ms
  Error rate: 0.82%
  CPU peak: 12% of allocation
  Memory peak: 60% of allocation

flowchart TB
    subgraph PC[Per Container]
        CPU[CPU: 2 vCPU]
        MEM[Memory: 4 GiB]
        PM2W[PM2: 2 workers per container]
        V8[V8: --max-old-space-size=1536 per worker]
    end

    subgraph SC[Scaling]
        SPA[SPA: min 5, max 20 replicas]
        API[API: min 3, max 20 replicas]
    end

    subgraph RES[Result at 6× production load]
        RT[Median response time: 165 ms]
        ER[Error rate: 0.82%]
        CPUU[CPU peak: 12% of allocation]
        MEMU[Memory peak: 60% of allocation]
    end

    PC --> SC --> RES

Lessons Learned

Node.js is not a “fire and forget” runtime

Unlike compiled languages with deterministic memory management, Node.js requires active memory management for long-lived processes. V8 heap caps, PM2 restarts, and minimum replica sizing are not optimizations — they are necessities.

Size for memory, not CPU

SSR workloads are I/O-bound. The CPU spends most of its time waiting for backend responses. Provision memory generously and CPU conservatively. A 1:2 vCPU:GiB ratio is a solid starting point.

Cold starts are the hidden enemy of auto-scaling

Auto-scaling sounds effortless until you realize new replicas take several seconds to become productive. During that window, existing replicas have to absorb the load. If they cannot, cascading failures follow. Adequate minReplicas removes that risk.

Load test the validated configuration, not the ideal one

It is tempting to load test with generous resources and right-size later. But right-sizing can expose failure modes that do not exist at larger sizes. Always load test the production configuration, not a more generous version.

What’s Next

Article 11: Multi-Environment Infrastructure — Azure Container Apps and the Configuration System — Managing three environments with generated configuration.
Article 12: Security in a Nuxt SSR App — CSRF, Azure AD, CSP, and More — The security layers that protect a server-rendered application.
Article 13: Observability and Distributed Tracing — Application Insights End-to-End — How every request is traced from the reverse proxy through the application to the backend.

Munir Husseini is a software architect specializing in full-stack TypeScript, .NET, and cloud-native architectures.