We tested the internet's favorite stat-arb template. It died in training.

AlgoProven Research · June 2026 · #3 in a series on why backtests lie · #1 · #2

Z-score pairs mean reversion — long the cheap leg, short the rich one, wait for the spread to revert — is the most-cloned strategy in retail quant. Every open-source tutorial ships it. So we ran it on futures with the same locked protocol we use for everything, expecting at least a weak survivor. It failed the very first gate. The interesting part is how.

The setup

Three correlated pairs — ES/NQ (equity indices), GC/SI (gold/silver), CL/RB (crude/gasoline) — on the log price ratio. Entry when the rolling z-score crosses ±2, exit on reversion toward the mean, disaster stop at z = 3.5, daily bars, 2008–2026. Locked protocol, the same one that killed four intraday families in part one: train on 2008–2017, judge on 2018–2026, and the edge has to survive both blocks. No tuning on the test set.

The result

PairTrain PF (08–17)Test PF (18–26)Verdict
ES / NQ0.721.09fails — train negative
GC / SI0.200.25−$948k, broken
CL / RB0.710.69fails both blocks

ES/NQ — the cleanest pair, with synchronized quarterly rolls — posted a training profit factor of 0.72. That's a losing system in the first block, full stop. Its test-block 1.09 doesn't matter: a strategy you can only call profitable after seeing the test data isn't a strategy, it's hindsight. By protocol, ES/NQ is dead and there is no tuning round.

The lesson is in how it broke

The metal and energy pairs lost catastrophically, and digging into why produced the finding worth more than the result:

Non-synchronized contract rolls quietly invalidate the naive pairs backtest. On GC/SI, 78 of 118 exits were forced rolls — the two contracts expire on different calendars, so the engine kept getting kicked out of positions before the spread could revert. You aren't testing mean reversion at that point. You're testing a roll schedule.

Two more structural problems compounded it. A log price ratio with no hedge ratio assumes the two legs are one-to-one — but NQ outran ES for eighteen straight years, so "short the rich one" was a slow short of a secular uptrend. And the metals carried their own regime shocks: the 2020 silver squeeze and the 2025 gold run each handed five- and six-figure losses to a strategy betting on reversion that never came.

Why we publish our failures

We could have kept tuning — shorten the lookback, widen the bands, drop the broken pairs, switch to calendar-aware rolls — until some configuration showed a green number. That is exactly the process that manufactures fake edges, and we wrote part two about how a real edge looks like a small number that survives nineteen years, not a big number you tuned into existence.

The discipline is the product. Every strategy we'd ever put on a real prop account passes a real-fill, locked train/test gate first — and most popular ideas, including this one, don't. The two that did (a 1.30 profit-factor overnight edge and a 1.10 session-flow read) are what our live bots actually trade, in public, losses included.