Nineteenth in a series about migrating from legacy architectures to a modern Nuxt 4 stack.
Architecture Decisions Have Consequences — Measure Them
Architecture decisions accumulate, and their combined effect only becomes visible under real load.
Before production, a large enterprise application was load-tested with production-equivalent patterns, not synthetic traffic. k6 replayed a model derived from real production logs: 20 pages, weighted by actual traffic share.
The Headline Numbers
| Metric | Legacy System | New System | Change |
|---|---|---|---|
| Median response time | 2,618 ms | 165 ms | 15.9× faster |
| Error rate (1× prod load) | 3.91% | 0.09% | 97% lower |
| Max tested capacity | ~99 RPM | 494+ RPM | 5× more |
| Infrastructure | 3× fixed VMs (24 vCPU, 96 GB) | Auto-scaled containers | Elastic |
| Lighthouse Performance (mobile) | ~50 | 97+ | Near-perfect |
A 2.6-second median means the better half of requests still took 2.6 seconds. A 165 ms median means the page renders before a user can blink.
Test Methodology
Traffic Pattern
The load test replayed production-equivalent traffic using k6’s HTTP module:
pie showData title Traffic Distribution (top 10 pages) "Homepage (28%)" : 28 "Product Overview (19%)" : 19 "Product Details (14%)" : 14 "Checkout Step 1 (9%)" : 9 "FAQ (7%)" : 7 "Contact (6%)" : 6 "About (5%)" : 5 "Legal / Imprint (4%)" : 4 "Blog Overview (3%)" : 3 "Other (11 pages) (5%)" : 5
Test Types
Two test types were run:
- Replay Test — constant load at 1× production traffic (99 RPM) for 30 minutes
- Ramp Test — linear ramp from 1× to 5× production traffic over 30 minutes
Replay Test: 1× Production Load
The replay test answers: “Can the new system handle current production traffic?”
flowchart TB
title["Replay Test Results (1× production load = 99 RPM)"]
subgraph Legacy_System["Legacy System"]
L_Median["Median RT: 2,618 ms"]
L_P95["P95 RT: 8,500+ ms"]
L_Error["Error Rate: 3.91%"]
L_RPM["Requests/min: 99"]
L_Status["Status: Degraded"]
end
subgraph New_System["New System"]
N_Median["Median RT: 168 ms"]
N_P95["P95 RT: 450 ms"]
N_Error["Error Rate: 0.09%"]
N_RPM["Requests/min: 99"]
N_Status["Status: Healthy"]
end
L_Median --- N_Median
L_P95 --- N_P95
L_Error --- N_Error
L_RPM --- N_RPM
L_Status --- N_Status
The new system handles production traffic with 96% lower response times and 97% fewer errors. The P95 at 450 ms means even the slowest 5% of requests are faster than the legacy system’s median.
Ramp Test: Finding the Ceiling
The ramp test answers: “How far can we push it before it breaks?”
xychart-beta title "Ramp Test Results (1× → 5× production load)" x-axis "Load (× production)" [1, 2, 3, 4, 5] y-axis "Response Time (ms)" line [2618, 4000, 5000, 6000, 8800] line [165, 165, 165, 165, 165]
The median stayed flat at 165 ms even at 5× load. There was no linear degradation: additional load did not increase per-request latency.
The P95 degraded to 8.8 seconds at 5×, driven by scale-out lag. New replicas needed time to start; once they were online, they matched existing replica performance.
The Right-Sizing Experiment
Finding the minimum viable resource allocation is a critical part of load testing. Four configurations were tested:
| Config | vCPU | RAM | PM2 Workers | V8 Heap | Result |
|---|---|---|---|---|---|
| #1 | 4 | 8 GiB | 3 | 2048 MB | ✅ Stable, over-provisioned |
| #2 | 2 | 4 GiB | 2 | 1536 MB | ✅ Stable, efficient |
| #3 | 1 | 2 GiB | 2 | 1024 MB | ❌ Cascading failures |
| #4 | 2 | 4 GiB | 2 | 1536 MB | ✅ Validated (6× load) |
flowchart LR A["Config #1: 4 vCPU / 8 GiB / 3 workers / 2048 MB heap"] -->|Over-provisioned| B["Config #2: 2 vCPU / 4 GiB / 2 workers / 1536 MB heap"] B -->|Right-size further| C["Config #3: 1 vCPU / 2 GiB / 2 workers / 1024 MB heap"] C -->|Cascading failures| D["Config #4: 2 vCPU / 4 GiB / 2 workers / 1536 MB heap (Validated at 6× load)"]
The Failed Right-Sizing (Config #3)
Reducing to 1 vCPU / 2 GiB caused a cascade:
sequenceDiagram participant L as Load Generator participant R1 as Replica 1 participant R2 as Replica 2 participant R3 as Replica 3 participant HP as Health Probe Note over R1,R3: Failure Cascade at 1 vCPU / 2 GiB L->>R1: t=0s: Traffic (99 RPM) Note over R1: Memory: 1,791 / 2,048 MB (87.5%) R1-->>R1: t=10s: V8 GC stalls<br/>Event loop blocked HP->>R1: t=15s: Health probe HP-->>HP: Timeout HP->>R1: Mark unhealthy → restart Note over R2: t=20s: Absorbs 2× traffic L->>R2: Increased traffic R2-->>R2: t=25s: Memory spike → restart Note over R3: t=30s: Overloaded → restart Note over R1,R3: t=35s: All replicas restarting Note over L: t=45s: Zero capacity for ~10 seconds<br/>→ 5% error rate
V8 needs breathing room. At 87.5% heap utilization, GC pauses block the event loop long enough for health probes to time out. The minimum viable compute here was 2 vCPU / 4 GiB, though the exact threshold depends on application complexity, page weight, and caching. The principle is general; the numbers are specific.
The Validated Production Configuration
The configuration that passed k6’s exit-code-0 threshold at 6× production load:
flowchart TB
subgraph SPA["SPA Containers"]
SPA_CPU["CPU: 2 vCPU"]
SPA_MEM["Memory: 4 GiB"]
SPA_PM2["PM2 Workers: 2 per container"]
SPA_HEAP["V8 Heap: 1536 MB (--max-old-space-size=1536)"]
SPA_MIN["Min Replicas: 5"]
SPA_MAX["Max Replicas: 20"]
end
subgraph API["API Containers"]
API_CPU["CPU: 0.5 vCPU"]
API_MEM["Memory: 1 GiB"]
API_MIN["Min Replicas: 3"]
API_MAX["Max Replicas: 20"]
end
subgraph Results["Result at 6× load"]
RES_MED["Median RT: 165 ms"]
RES_ERR["Error rate: 0.82%"]
RES_CPU["CPU peak: 12% of allocation"]
RES_MEM["Memory peak: 60% of allocation"]
end
SPA --> Results
API --> Results
Cost Analysis
50% less CPU and 50% less memory per replica compared to the initial over-provisioned config:
flowchart TB
subgraph Legacy["Legacy (fixed)"]
L1["3× VM instances"]
L2["24 vCPU, 96 GB RAM — always on"]
L3["Cost: constant regardless of traffic"]
end
subgraph New["New (elastic)"]
N1["5–20 SPA replicas (2 vCPU, 4 GiB each)"]
N2["3–20 API replicas (0.5 vCPU, 1 GiB each)"]
N3["Per-second billing — pay for actual usage"]
N4["At idle: 5 SPA + 3 API"]
N5["At peak: 15 SPA + 8 API"]
N6["Average: ~60% of peak capacity billed"]
end
Legacy -->|"Migrated to"| New
Elastic billing lowers cost during low-traffic periods — nights, weekends, and holidays — while still scaling for spikes without permanent over-provisioning.
What the Numbers Mean for Architecture
Each architecture decision from earlier articles contributed to these numbers:
| Decision | Contribution |
|---|---|
| SSR (Article 1) | Eliminates client-side rendering delay |
| GraphQL Gateway (Article 2) | Single query per page instead of 3–5 REST calls |
| Multi-Tier Cache (Article 6) | Sub-ms content retrieval for cached pages |
| Deferred Hydration (Article 6) | Eliminates render-blocking JavaScript |
| Same-Origin Image Proxy (Article 6) | Improves LCP by reducing cross-origin overhead |
| PM2 Cluster Mode (Article 10) | Zero-downtime worker restarts |
| Container Apps Auto-Scaling (Article 11) | Elastic capacity, no over-provisioning |
flowchart LR SSR["SSR"] --> PERF["Lower TTFB & faster first paint"] GQL["GraphQL Gateway"] --> PERF CACHE["Multi-Tier Cache"] --> PERF HYDR["Deferred Hydration"] --> PERF IMG["Same-Origin Image Proxy"] --> PERF PM2["PM2 Cluster Mode"] --> REL["Resilience & zero-downtime deploys"] AS["Container Apps Auto-Scaling"] --> CAP["Elastic capacity"] PERF --> OUT["15.9× faster median\nLighthouse 97+"] REL --> OUT CAP --> OUT
No single decision produces 15.9×. It is the combination — each one removing a different bottleneck — that delivers the aggregate result.
Lessons Learned
Load test with production traffic patterns, not synthetic ones
A synthetic test hitting the homepage 100 times per second says nothing about real-world performance. Real traffic has a distribution — heavy pages, light pages, API calls, form submissions. The test must match it.
flowchart LR A["Synthetic test: 100 req/s to homepage"] -->|Misleading| C["Unrealistic bottlenecks"] B["Production-equivalent mix:\nheavy pages, light pages, APIs, forms"] -->|Accurate| D["Realistic capacity & latency insights"]
Right-sizing failures are the most valuable test results
The cascading failure at 1 vCPU / 2 GiB taught more about system behavior than all successful tests combined. It exposed the GC pressure threshold, health probe timing sensitivity, and cold-start vulnerability. These insights shaped the production configuration.
flowchart TB F["Right-sizing attempt"] --> F1["Too small (1 vCPU / 2 GiB)"] F1 --> F2["GC pressure & probe timeouts"] F2 --> F3["Cascading restarts"] F3 --> F4["Error budget impact"] F4 --> F5["Refined production config\n(2 vCPU / 4 GiB, validated at 6×)"]
Median response time is the metric that matters most
P95 and P99 matter for tail latency, but the median determines the experience for most users. A flat median under increasing load (165 ms at 1× and 5×) proves horizontal scaling without per-request degradation.
xychart-beta title "Median vs P95 under load" x-axis "Load (× production)" [1, 2, 3, 4, 5] y-axis "Response Time (ms)" line [165, 165, 165, 165, 165] line [450, 1200, 3000, 6000, 8800]
15× is not an optimization — it is a different architecture
A 15.9× improvement does not come from optimizing an existing system. It comes from removing fundamental bottlenecks: dual rendering, multi-source data joining, absence of caching, fixed infrastructure. The improvement is architectural, not incremental.
What’s Next
- Article 16: The Full Picture — What the New Concept Delivers — Synthesis for decision-makers and architects.
- Article 17: The
@delegateDirective Deep Dive — Cross-Subgraph Field Resolution — A technical deep dive into the most powerful schema stitching feature. - Article 18: Building a Headless Design System in Vue 3 — The Compose Pattern — Separating style logic from templates.
Munir Husseini is a software architect specializing in full-stack TypeScript, .NET, and cloud-native architectures.
Leave a Reply