Backtesting

We Combined 1.1 Million Strategy Pairs. Three in Four Diluted the Better One.

2026.07.04·9 min read·Rulyfi

Key Takeaways

We combined all 1,131,321 pairs of gauntlet-surviving Bitcoin strategies, 417 longs by 2,713 shorts, with zero extra backtests, and scored each pair's luck-adjusted confidence (the Deflated Sharpe Ratio) in closed form. 74.61% came out worse than simply keeping the better of the two strategies alone, and 74.21% after collapsing parameter twins.

Worse than your best is not worse than luck. 91.7% of the composites still cleared the luck baseline (DSR > 0.5), and only 4.44% fell below even the weaker leg. Combining rarely beats your best idea, and it rarely breaks it either.

Combining can never raise per-trade quality: by the Cauchy-Schwarz inequality, a composite's per-trade Sharpe is at most the better leg's. The only thing pooling buys is evidence, more trades and a lower luck hurdle, which is why a strong-plus-weak pair dilutes and a strong-plus-strong pair can lift.

Only 3 of 1,131,321 pairs cleared DSR 0.95, one idea after removing twins, and every one was a strong long pooled with a strong, nearly independent short. The rule that falls out: pool two strong, nearly independent survivors, and never average a champion with an also-ran.

You finally have a strategy that survives. It cleared the luck baseline, it traded often enough to trust, and it held up out of sample, so you do the natural thing: you reach for a second one. Hedge the long with a short, stack another signal for confirmation, tell yourself two edges are safer than one. We measured that instinct across every pair of our best strategies, and three out of four times, combining made things worse.

That sounds like a reason never to combine, but it is not. Worse than your best strategy is not the same as worse than luck: 91.7% of these combinations still cleared the luck baseline, and only 4.44% fell below even the weaker of the two legs. Combining rarely improves your best idea, and it usually dilutes it, but it rarely breaks it. The gap between the one in four that helped and the three in four that hurt is not random. It is a rule you can check before you merge a single pair, and there is a one-line reason, the Cauchy-Schwarz inequality, that predicts all of it.

A note on scope before the numbers, because it decides what the result means. We combined one surviving long with one surviving short on the same asset, Bitcoin, modeled as non-overlapping trade streams: a long-only bot and a short-only bot taking turns. This measures evidence pooling, not simultaneous portfolio diversification. It does not speak to holding many uncorrelated strategies across different assets, to three-way or ten-way portfolios, or to combining two longs. Every figure below is a closed-form arithmetic estimate from the two legs' return moments, not a fresh backtest. This is the distribution behind the single pooled result that closed our hundred-million-backtest study: that earlier study showed that one pair, and this one shows the other 1,131,320.


Market	BTC/USDT spot (Binance), 1h
Population	417 long by 2,713 short gauntlet survivors, so 1,131,321 long-short pairs
Method	closed-form pooled moments, non-overlapping (alternating) trades, an arithmetic estimate, not a re-backtest
Metric	lift = composite DSR − max(component DSR), both at the shared pooled hurdle SR* = 0.4695, N = 96.2 million
Short leg	modeled on the spot price series, no funding, no borrow

1. The free lunch everyone recommends

Every quant blog gives the same advice: combine uncorrelated strategies and your risk-adjusted numbers improve. Hedge a long with a short so you make money in both directions. Stack a second indicator so trades only fire when both agree. Run two systems side by side so the drawdowns of one land in the good months of the other. It is the closest thing trading has to folklore, and the appeal is obvious: diversification is supposed to be the one free lunch in markets. We wanted to know whether the folklore survives an honest test, so instead of judging combinations on raw Sharpe, where more trades and smoother equity always look better, we judged them on a luck-adjusted basis. The Deflated Sharpe Ratio asks how confident you can be that a track record is not just the luckiest fluke out of everything you searched, and it is unimpressed by a longer record on its own.¹ Graded that way, the free lunch has a price tag.

2. How to combine two strategies without running a single new backtest

You do not have to backtest pairs to know how they would score. Combining two strategies is a closed form in their return moments. Brute-forcing every long-short pair from our scan would have meant 5×10⁷ by 5×10⁷, about 2.5×10¹⁵ backtests, and it would have taught us nothing new, because a good composite can only be built from good components you have already found. So we took the 3,130 strategies that survived the gauntlet in our hundred-million-backtest study (the ones that cleared every filter: 30 or more trades, a Deflated Sharpe above the luck baseline, and positive out-of-sample), 417 longs and 2,713 shorts, and combined all 417 × 2,713 = 1,131,321 long-short pairs directly from their moments, in seconds, with zero extra backtests.

To make the comparison honest we asked each pair the same question: did combining beat simply keeping the better of the two strategies on its own? That is the lift, the composite's Deflated Sharpe minus the higher of its two components' Deflated Sharpe, with both scored at the identical pooled hurdle (SR* = 0.4695 across N = 96.2 million trials, the 95.1 million original valid trials plus the pairs). Holding the hurdle fixed on both sides isolates the effect of pooling from the effect of a bigger search, so nothing is graded on a curve. A positive lift means the combination earned its complexity; a negative one means you would have been better off with the single strategy you already had.

3. Three out of four combinations got worse

Across all 1,131,321 pairs, 25.39% improved on the better leg and 74.61% came out worse. The median pair lost 0.023 of Deflated Sharpe against its stronger component, and the 75th-percentile pair barely broke even at +0.0003, so only the top quarter of combinations meaningfully helped at all. At the far end, 4.44% were diluted so badly the composite scored below even the weaker of the two strategies.

How combining changed each pair's luck-adjusted confidence, across all 1,131,321 pairs. Anything left of break-even scored worse than simply keeping the better strategy alone, where 74.61% of pairs land. Source: the 3,130 gauntlet survivors, closed-form composite.

This is not a small-sample fluke or a counting artifact. Many of our survivors are parameter twins, near-identical strategies that differ only in an exit setting; collapse them to 121 distinct long ideas and 825 distinct short ideas, 99,825 unique pairs, and the split is 25.8% better, 74.2% worse. Same story. Combining our best strategies, the ones that already beat luck, usually produced something with less luck-adjusted confidence than the stronger of the two had on its own.

4. Worse than your best is not worse than luck

The 74.61% number is easy to misread, so here is the guardrail: worse than your best strategy is not the same as worse than luck. Every one of these legs was already a gauntlet survivor, already sitting above the luck baseline, and pooling them mostly kept them there. 91.7% of the 1,131,321 composites still cleared DSR > 0.5, and only 4.44% fell below even the weaker leg. Combining rarely breaks a strategy. What it usually does is subtler and more expensive: it spends your best idea to buy confidence you did not need, leaving you with something that still beats a coin but no longer beats the strategy you started with. That is dilution, not destruction, and telling the two apart is the whole reason to grade on a luck-adjusted scale instead of on raw returns.

5. Why you cannot average your way to a better edge

There is a one-line reason three in four combinations dilute, and it is not bad luck. It is arithmetic. A long and a short on the same asset trade at different times, so the combined track record is just the two trade streams concatenated. By the Cauchy-Schwarz inequality, the per-trade Sharpe of that concatenation can never exceed the per-trade Sharpe of its better component. Combining, in other words, can never raise the quality of your edge. It can only ever pull the average toward the weaker leg.

So where does any improvement come from? Not quality, but evidence. Pooling two strategies gives you more trades, and more trades lower the luck hurdle a track record has to clear, because it is harder to fake a good Sharpe over 70 trades than over 35. That is the √n effect, and it is the only thing combining has to offer. The tug-of-war decides everything: if the second leg is nearly as strong as the first, per-trade quality barely drops while the trade count roughly doubles, the pooled evidence wins, and the composite rises. If the second leg is weak, it drags per-trade quality down faster than the extra trades can rebuild confidence, and the composite falls. A weak partner is a tax, not a hedge.

This is also the precise, and often disappointing, answer to a common question: does diversification reduce overfitting? Not the way people hope. Diversification reduces variance, the bumpiness of your equity curve, by holding positions that do not move together. It does not undo overfitting. Each leg's selection bias is already baked in before you combine anything, and the composite is then judged against an even higher hurdle for having searched more. On a luck-adjusted basis, stapling a strong strategy to a weak one usually lowers your confidence rather than raising it.

Every one of the 1,131,321 pairs, by the better component's Deflated Sharpe (horizontal) against the composite's (vertical). Below the dashed diagonal the composite scored worse than the better leg alone, where 74.61% of pairs sit; only the strong-plus-strong corner rises above it. Source: the 3,130 gauntlet survivors, closed-form composite.

6. When combining actually pays: two strong, nearly independent legs

The one in four that improved were not random either. They almost all shared one profile: two strong legs of comparable quality, pulling in nearly independent directions. In the brackets below, the tp, sl, and ttl tokens are each strategy's take-profit, stop-loss, and time-limit tiers, followed by its trade count and return. The single biggest lift in the whole dataset came from two evenly matched legs, a long BB(60,2)+StochRSI(5,4,10,13) [tp0/sl3/ttl2, 34 trades, +91%] at DSR 0.810 and a short MACD(12,37,8)+StochRSI(4,4,25,22) [tp3/sl1/ttl0, 33 trades, +117%] also at DSR 0.810. Pooled into 67 trades, the pair rose to a composite Deflated Sharpe of 0.899, an arithmetic estimate from the two legs' moments, not a backtested result. Neither leg lost much per-trade quality, and the doubled evidence did the rest.

Contrast that with the worst dilution in the set: a strong short, MACD(12,19,9)+StochRSI(4,4,7,19) [tp0/sl1/ttl0, 35 trades] at DSR 0.725, paired with a weak long, MACD(16,13,5)+BB(50,2) [tp3/sl2/ttl2, 36 trades] at DSR 0.515. The composite came out at 0.555, below the strong short's own 0.725. The weak partner did exactly what the arithmetic predicts.

The clearest way to see that the partner decides the outcome, not the strategy, is to watch one strategy take two different partners:

Long strategy (comparable strength)	Short partner	Composite DSR
BB(60,2)+StochRSI(5,4,10,13), DSR 0.810	a strong short, DSR 0.810	0.899
BB(60,2)+StochRSI(5,4,10,13), DSR 0.825	a weak short, DSR 0.461	0.658

The same long indicator pairing, at essentially the same strength, rose to 0.90 beside a strong partner and sank to 0.66 beside a weak one. (The same lesson hides inside a single pair of indicators: MACD(16,13,5)+BB(50,2) is both the strongest long in the scan at one exit setting and one of the weakest at another. The exits move the Deflated Sharpe as much as the indicators do.)

The rule's sharpest expression is the ceiling itself. Only 3 of the 1,131,321 pairs, one idea once you fold the twins, ever crossed DSR 0.95, the strictest line in the study, and all three were a strong long pooled with a strong, nearly independent short. The best of them took the single strongest component in the entire scan, at 0.878 on the pooled hurdle, and lifted it to 0.9502 by pooling it with a strong short partner: a long MACD(16,13,5)+BB(50,2) [tp2/sl3/ttl1, 37 trades, +168%] combined with a short MACD(16,37,6)+StochRSI(3,4,28,22) [tp3/sl3/ttl0, 35 trades, +157%], giving 72 trades at a per-trade Sharpe of 0.758. That is an arithmetic estimate, not a backtested result, and not investment advice; the returns quoted are each leg's own backtest, never a promise for the pair. Note that this best-confidence pair is not the best-lift pair: the champion started so high it had little room to climb, while the biggest jump belonged to two mid-strength legs. Both obey the same rule.

Why did three in four fail that rule here? Because on Bitcoin's 5.7-year uptrend most of the surviving shorts were marginal, so most pairs married a decent long to a weak short, and the weak leg won. What makes a long and a short worth pooling in the first place is that they are structurally near-independent: across the top configurations, long and short performance correlated just 0.148, so the two legs contribute close to independent evidence rather than double-counting one bet. But independence is necessary, not sufficient. The pair also needs two genuinely strong legs. Put the three findings together and the decision rule is simple: pool two survivors only when both are individually strong, close in strength, and nearly independent. Otherwise keep the one you have.

7. Do it yourself

Every number in this article comes from columns a Rulyfi full-export already contains, so you can run the same test on your own scans. The columns you need are srObs, skew, kurt, tradeCount, totalReturn, direction. The full-export itself is on every plan, Free included, at no credit cost, and it carries direction, tradeCount, totalReturn. The three columns that do the deflation, srObs, skew, kurt, are validation columns that any paid plan adds; the Free export leaves them out. From those, a pair's composite Deflated Sharpe is a few lines of arithmetic:

# for each leg i in {long, short}, from its export row:
mean_i, std_i  =  invert compounded totalReturn over tradeCount_i     # per-trade moments

# pool the two legs as non-overlapping (alternating) trade streams:
n    = n_long + n_short
w_i  = n_i / n
M1   = Σ w_i · mean_i
M2   = Σ w_i · (std_i² + mean_i²)
Sharpe_pair = M1 / √(M2 − M1²)                                        # composite per-trade Sharpe

# deflate at the pooled hurdle (same z-formula as the earlier study):
SR* = std(srObs across ALL trials) · Φ⁻¹(1 − 1/N)                     # N = full trial population
DSR_pair = Φ( (Sharpe_pair − SR*) · √(n−1)
              / √(1 − skew·Sharpe_pair + (kurt−1)/4 · Sharpe_pair²) )

Compute SR* once from the whole population, not just the survivors, since the hurdle depends on how many strategies you searched. Pool skew and kurtosis the same trade-weighted way, and the whole thing stays an arithmetic estimate under a non-overlapping-trades assumption, never a re-backtest. The per-strategy deflation that feeds it is the same z-formula that study spells out in its own recipe. One honest note on scope: we did not ship this as a button. The full-export opens the door, and teaching the method is the article.

8. Caveats

If a method's failure modes make you trust it less, it was overselling. Here are ours.

The composite is arithmetic, not a re-run. Every composite figure here, 0.9502, 0.899, 0.658, 0.555, is closed-form from the two components' moments. It assumes non-overlapping trades, reconstructs each leg's mean by inverting compounded totalReturn (a second-order approximation), and has not been end-to-end re-backtested. A rigorous estimate, not a result.
Non-overlap is a modeling choice. We model a long-only bot and a short-only bot alternating in time. If your two strategies genuinely held positions at the same time, you would net them rather than concatenate their trades, a different operation where classic diversification can raise the Sharpe. This finding does not speak to that case.
Pairwise long-short only, not a portfolio claim. One long plus one short on one asset. Not three-way, not ten-way, not multi-asset, not two longs. This answers whether to add one partner, not how to build a book of strategies.
Small samples cut both ways. Survivors sit at 30 to 63 trades, exactly where the skew and kurtosis terms that do the deflation are noisiest. Selective quality and unwashed luck are indistinguishable in-sample; only forward data separates them.
Near-independence is population-level, not per-pair. The 0.148 correlation is measured across the top configurations of each direction, not for any individual pair. We use it only to argue that a long and a short are structurally close to independent.
BTC/USDT spot, and the short is modeled. The short leg is modeled on the spot price series, with no funding, borrow, or liquidation. Its quoted returns are not achievable as stated on a live perpetual short.
The hurdle is a floor. We deflated for the ~96.2 million trials we counted, but the indicators themselves are the residue of decades of human data-mining, so the true search behind them is larger and our reported confidence is, if anything, optimistic.
DSR is an evidence level, not a signal. A 0.899 means less likely to be luck than most, never will make money.

9. What this does and doesn't mean

A million ways to combine our best strategies did not hand us a better strategy. They handed us a rule: pool two strong, nearly independent survivors, or keep the one you already have. That is not a disappointing answer, it is an honest one. Diversification smooths a curve; it does not manufacture an edge. Combining rarely beats your best idea and usually dilutes it, but the one time in four it helps is not luck, it is two strong, comparable, nearly independent legs, and you can check all three conditions before you merge a single pair.

So the move this data argues for is not "combine everything," it is "measure before you merge." Every gate here is a number the Rulyfi scanner already computes, the Deflated Sharpe of each idea, so you can see whether a pair earned its complexity or just added it. A full-export, which every plan can save at no extra cost, hands you the raw columns to run this exact composite math on your own scans, and the validation columns it uses come with any paid plan (see pricing); point it at your own two ideas and read whether the combination rose or diluted. For why single backtests inflate in the first place, see Your Crypto Backtest Is Lying to You, and for how to read the individual metrics as a stack, see How to Read Your Backtest Results. The tool worth trusting is the one that computes the luck-adjusted bar for you, not the one that promises to find winners.

Frequently asked questions

How do you combine two trading strategies? You can combine them in closed form from their return moments, with zero extra backtests, by pooling the two trade streams and reading the mixture's Sharpe straight out. But on a luck-adjusted basis (the Deflated Sharpe Ratio), combining improved on simply keeping the better strategy only about one time in four across all 1,131,321 pairs we combined. The other three in four diluted the stronger of the two.

Does diversification reduce overfitting? No. It reduces variance, not overfitting. Combining pools evidence, more trades and a lower luck hurdle, but it cannot raise per-trade quality, and each leg's selection bias is already baked in before you combine. On a deflated basis, pooling a strong strategy with a weak one usually lowers your confidence. The classic "diversification reduces risk" result comes from holding imperfectly correlated positions at the same time, a different operation from the alternating-trade pooling measured here.

Is a portfolio of many strategies safer than trading one? This study cannot answer that directly, because it measures adding one partner, one long plus one short on the same asset, not building a multi-strategy or multi-asset book. Within that scope, adding a partner helped only about one time in four, and only when the partner was itself strong and nearly independent; a weak partner diluted the stronger leg. Genuine multi-asset diversification, many imperfectly correlated positions held at once, is a different and often useful operation this study did not test.

Auto-trading and trading carry a risk of losing your principal. This article is educational, does not guarantee profit, and past backtest results do not predict future returns.

Bailey, D. H., & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality". The Journal of Portfolio Management, 40(5). ↩

deflated-sharpecombining-strategiesdiversificationstrategy-correlationbacktest-overfitting