What is schema drift in a backtest config and how does it fake results?

Schema drift happens when parameter field names in your config no longer match the runtime schema, so a permissive deserializer silently drops the unknown fields and falls back to defaults. In this case the config's leverage field was dropped and base_leverage defaulted to 10x instead of the intended 2x, making the backtest run a different strategy than the one written while still producing real-looking numbers.

How can a backtest show a textbook overfit signature that is actually wrong?

A Train PF of 2.08 versus OOS PF of 0.94 (ratio 2.21) matches the classic overfit pattern quant literature describes, so it is easy to accept. Here the signature was an artifact of a hidden 10x leverage default that amplified both numbers; once the parameters loaded correctly at 2x leverage, the same strategy gave Train 1.494 / OOS 1.478, ratio 1.01, which is stable rather than overfit.

Why does dataclass.from_dict silently drop unknown fields in Python?

A typical dataclass.from_dict implementation filters the input to only valid field names without raising on unknown keys, so misnamed fields simply vanish and the dataclass uses its defaults. Pydantic behaves the same way unless you enable strict mode, which is why one missing raise statement let a leverage value silently default to 10x.

What is the fastest way to catch a parameter-loading bug before trusting a backtest?

Print the effective parameters right after deserialization, for example print(vars(params)) or printing leverage, sl_atr_mult, and tp_rr_ratio, and assert the key values match what you wrote. This five-second check would have exposed the leverage mismatch and saved seven follow-up experiments.

Why is single-asset backtest stability not enough to trust a trading strategy?

Even after the BTC 304-day result was stable at ratio 1.01, testing the same parameters across eight crypto pairs gave a ratio standard deviation (2.42) larger than the mean (1.82), which signals noise. The standout asset DOT looked strong until per-segment decomposition showed one lucky month carried the entire return, with a walk-forward IS/OOS ratio of 6.47 revealing real overfit.

Schema Bug Faked My Overfit Diagnosis: The Backtest Postmortem Nobody Talks About

Meta Description: Ran 7 quant experiments, “textbook overfit” turned out to be a schema bug. The corrected version is stable. The meta-lesson is uglier than the original.

The original report was clean. Train PF 2.08, OOS PF 0.94, ratio 2.21. Anyone who has read quant literature recognizes this signature — the optimizer fits noise that doesn’t repeat. Filed it as overfit, moved on.

Then came the follow-up experiments. And the discovery that the diagnosis itself was wrong.

This is the postmortem. The strategy isn’t where the bug is. The bug is in how we believed the numbers.

Schema Bug Faked My Overfit Diagnosis: The Backtest Postmortem Nobody Talks About — dibi8.com

⚡ TL;DR #

Original conclusion: Textbook overfit on BTC 304d (PF 2.08 → 0.94, ratio 2.21).

Real finding: Schema field mismatch. evolved_final_params.json used leverage / tp_atr_mult field names; current schema uses base_leverage / tp_rr_ratio. from_dict() silently dropped them. Actual run used default 10x leverage, not the evolved 2x.

Corrected result: PF 1.494 / 1.478, ratio 1.01. Boringly stable. Not overfit.

But also: Cross-asset test still shows break-even at best. DOT walk-forward IS/OOS ratio 6.47 — actual textbook overfit hiding in a “lucky segment” story.

Meta-lesson: Validate parameter loading before trusting backtest output. Five seconds of print(vars(params)) would have saved seven experiments.

The Original “Discovery” #

We ran moss-trade-bot-skills v1.0.26 paper mode on BTC/USDC 15m bars, July 2025 → April 2026. 304 days, 29184 bars, 70/30 split.

The strategy was a mean-revert variant evolved by the framework’s parameter optimizer. The evolved configuration looked sensible: low trend weight, high mean-revert weight, conservative 2x leverage, symmetric sl/tp.

Backtest results came back clean:

Train (212 days): PF 2.08
OOS (92 days): PF 0.94
Ratio: 2.21

Train/OOS ratio above 2.0 is the textbook overfit signature. We filed it under “evolution found Q3-Q4 2025 specific noise, didn’t generalize.” Plausible story, matched the data, end of session.

The Follow-up That Broke the Story #

The next day we tried multi-asset validation — same evolved parameters on ETH for the same window. Expected pattern: if parameters captured signal, they should generalize.

ETH ran. PF 1.154 → 0.697, ratio 1.66. Mild overfit, mostly consistent with our diagnosis.

Then we tested BTC on a shorter 148-day window matching ETH’s data range. Different sub-window of the same asset.

Result: PF 0.980 → 1.581, ratio 0.62. Reversed pattern. OOS better than Train.

That’s where the diagnosis started failing.

Same parameters, same asset, different time windows giving opposite patterns. Either the strategy is noise (true), or the windows have very different regimes (also true), or — and this is what we eventually checked — the parameters weren’t what we thought.

The Schema Drift #

In Python’s typical dataclass.from_dict() pattern, unknown fields are silently dropped. Pydantic does it too unless you set strict mode.

The evolved configuration file contained:

{
  "leverage": 2,
  "sl_atr_mult": 2.5,
  "tp_atr_mult": 2.5,
  ...
}

The runtime DecisionParams schema expected:

base_leverage: float = 10.0
max_leverage: float = 40.0
sl_atr_mult: float = ...
tp_rr_ratio: float = ...

leverage → silently dropped → base_leverage defaults to 10.0. tp_atr_mult → silently dropped → tp_rr_ratio defaults to its own value.

The “evolved 2x leverage with symmetric 2.5/2.5 ATR multipliers” we thought we were running was actually “default 10x leverage with whatever the default tp_rr_ratio is.”

Five seconds of print(vars(params)) after from_dict() would have shown this. We didn’t do it.

The Corrected Numbers #

Same BTC 304d, same 70/30 split, same evolved parameters — but mapped correctly to current schema fields:

Train PF: 1.494
OOS PF: 1.478
Ratio: 1.01

That’s not overfit. That’s one of the most stable Train/OOS ratios we’ve ever seen.

The strategy isn’t broken. The diagnosis was broken.

What Was Still True #

The corrected results are stable on BTC 304d, but cross-asset testing tells a less flattering story.

Eight crypto pairs, same 148-day window, same corrected parameters:

Asset	Train PF	OOS PF	Ratio
ETH	1.154	0.697	1.66
BNB	1.512	0.213	7.10
AVAX	0.581	1.302	0.45
LINK	1.055	0.519	2.03
ARB	0.628	1.527	0.41
DOT	1.647	1.907	0.86
NEAR	0.358	1.415	0.25

Ratio stdev (2.42) exceeds mean (1.82). When the spread of a metric is larger than its central tendency, you’re looking at noise.

DOT looked like the standout — Train 1.65, OOS 1.91, both strong. But splitting DOT’s 148 days into five ~30-day segments revealed Segment 1 alone (Oct-Nov 2025) carried PF 7.82 and the entire +1.67% return. The other four segments combined to -0.68%. The “cross-asset alpha” was one lucky month.

A walk-forward test confirmed: Segment 1 as in-sample, Segments 2-5 as out-of-sample. IS PF 7.82 → OOS PF 1.21. IS/OOS ratio 6.47 — the actual textbook overfit hiding inside an asset where the surface-level numbers looked good.

The Defenses #

Three layers, in order of effort/value:

1. Strict deserialization. Make your parameter loader reject unknown fields. In Python:

@dataclass(frozen=True, kw_only=True)
class DecisionParams:
    base_leverage: float = 10.0
    # ...
    
    @classmethod
    def from_dict(cls, d: dict) -> "DecisionParams":
        valid = {f.name for f in cls.__dataclass_fields__.values()}
        unknown = set(d.keys()) - valid
        if unknown:
            raise ValueError(f"Unknown fields: {unknown}")
        return cls(**{k: v for k, v in d.items() if k in valid})

The original from_dict() filtered to valid fields without raising on unknown fields. One missing raise cost seven experiments.

2. Print effective params before backtest. Three lines:

params = DecisionParams.from_dict(raw)
print(f"Effective: leverage={params.base_leverage}, sl={params.sl_atr_mult}, tp={params.tp_rr_ratio}")
assert params.base_leverage == raw.get("base_leverage", raw.get("leverage")), "leverage mismatch"

3. Pin parameter file schema version. When the framework’s schema changes, old parameter files should fail loudly, not silently degrade.

The New “Seven Don’ts” — Now Thirteen #

The original seven backtest discipline rules grew to thirteen after this incident. The six new ones come directly from these experiments:

Don’t trust experiments without schema validation. Print params before backtest.
Don’t make calls on datasets under 200 trading days. 148-day sub-windows of the same asset gave opposite diagnoses.
Don’t accept PF > 3 with under 30 trades. Default red flag.
Don’t ship strategies without cross-asset validation. Single-asset stability is necessary, not sufficient.
Don’t ignore stdev/mean ratio. Above 1 means noise, no matter how good the mean looks.
Don’t report PF without per-segment decomposition. Single-window summaries hide lucky-segment artifacts.

The Hard Part #

The original report sat in our archive for a day before the follow-up exposed it. If we had stopped at “Train PF 2.08 → OOS 0.94, ratio 2.21” we would have shared a confidently wrong diagnosis. The numbers were real. The story we told around them wasn’t.

Backtest results are easy to produce, easy to summarize, easy to share. Validating the assumptions behind the numbers is harder, slower, and less rewarding. But it’s the only step that distinguishes “we ran a thing and here’s what happened” from “we know what happened.”

If you only take one habit from this postmortem: print your effective params before every backtest. Five seconds. Saves seven experiments.

Recommended Infrastructure #

For walk-forward + multi-asset experiment scaffolding:

DigitalOcean — $200 credit, easy GPU/CPU droplets
HTStack — Hong Kong VPS, low-latency to Asia exchange APIs

Affiliate links — same price, supports dibi8.com.

Schema Bug Faked My Overfit Diagnosis: The Backtest Postmortem Nobody Talks About

⚡ TL;DR #

The Original “Discovery” #

The Follow-up That Broke the Story #

The Schema Drift #

The Corrected Numbers #

What Was Still True #

The Defenses #

The New “Seven Don’ts” — Now Thirteen #

The Hard Part #

Recommended Infrastructure #

References & Sources #

📦 Featured in collections

💬 Discussion

⚡ TL;DR #

The Original “Discovery” #

The Follow-up That Broke the Story #

The Schema Drift #

The Corrected Numbers #

What Was Still True #

The Defenses #

The New “Seven Don’ts” — Now Thirteen #

The Hard Part #

Recommended Infrastructure #

References & Sources #

🔗 Related Resources

📦 Featured in collections

💬 Discussion