Schema Bug Faked My Overfit Diagnosis: The Backtest Postmortem Nobody Talks About

Ran 7 quant experiments, found "textbook overfit" (Train PF 2.08 → OOS 0.94, ratio 2.21). Then discovered the diagnosis itself was wrong — silent schema field mismatch made the optimizer run with default 10x leverage instead of the evolved 2x. The corrected version is healthy (ratio 1.01). The meta-lesson is uglier than the original.

  • Python
  • pandas
  • numpy
  • vectorbt
  • backtrader
  • pydantic
  • MIT
  • Updated 2026-05-26

{{< resource-info >}}

Schema Bug Faked My Overfit Diagnosis #

Meta Description: Ran 7 quant experiments, “textbook overfit” turned out to be a schema bug. The corrected version is stable. The meta-lesson is uglier than the original.

The original report was clean. Train PF 2.08, OOS PF 0.94, ratio 2.21. Anyone who has read quant literature recognizes this signature — the optimizer fits noise that doesn’t repeat. Filed it as overfit, moved on.

Then came the follow-up experiments. And the discovery that the diagnosis itself was wrong.

This is the postmortem. The strategy isn’t where the bug is. The bug is in how we believed the numbers.

⚡ TL;DR #

Original conclusion: Textbook overfit on BTC 304d (PF 2.08 → 0.94, ratio 2.21).

Real finding: Schema field mismatch. evolved_final_params.json used leverage / tp_atr_mult field names; current schema uses base_leverage / tp_rr_ratio. from_dict() silently dropped them. Actual run used default 10x leverage, not the evolved 2x.

Corrected result: PF 1.494 / 1.478, ratio 1.01. Boringly stable. Not overfit.

But also: Cross-asset test still shows break-even at best. DOT walk-forward IS/OOS ratio 6.47 — actual textbook overfit hiding in a “lucky segment” story.

Meta-lesson: Validate parameter loading before trusting backtest output. Five seconds of print(vars(params)) would have saved seven experiments.

The Original “Discovery” #

We ran moss-trade-bot-skills v1.0.26 paper mode on BTC/USDC 15m bars, July 2025 → April 2026. 304 days, 29184 bars, 70/30 split.

The strategy was a mean-revert variant evolved by the framework’s parameter optimizer. The evolved configuration looked sensible: low trend weight, high mean-revert weight, conservative 2x leverage, symmetric sl/tp.

Backtest results came back clean:

  • Train (212 days): PF 2.08
  • OOS (92 days): PF 0.94
  • Ratio: 2.21

Train/OOS ratio above 2.0 is the textbook overfit signature. We filed it under “evolution found Q3-Q4 2025 specific noise, didn’t generalize.” Plausible story, matched the data, end of session.

The Follow-up That Broke the Story #

The next day we tried multi-asset validation — same evolved parameters on ETH for the same window. Expected pattern: if parameters captured signal, they should generalize.

ETH ran. PF 1.154 → 0.697, ratio 1.66. Mild overfit, mostly consistent with our diagnosis.

Then we tested BTC on a shorter 148-day window matching ETH’s data range. Different sub-window of the same asset.

Result: PF 0.980 → 1.581, ratio 0.62. Reversed pattern. OOS better than Train.

That’s where the diagnosis started failing.

Same parameters, same asset, different time windows giving opposite patterns. Either the strategy is noise (true), or the windows have very different regimes (also true), or — and this is what we eventually checked — the parameters weren’t what we thought.

The Schema Drift #

In Python’s typical dataclass.from_dict() pattern, unknown fields are silently dropped. Pydantic does it too unless you set strict mode.

The evolved configuration file contained:

{
  "leverage": 2,
  "sl_atr_mult": 2.5,
  "tp_atr_mult": 2.5,
  ...
}

The runtime DecisionParams schema expected:

base_leverage: float = 10.0
max_leverage: float = 40.0
sl_atr_mult: float = ...
tp_rr_ratio: float = ...

leverage → silently dropped → base_leverage defaults to 10.0. tp_atr_mult → silently dropped → tp_rr_ratio defaults to its own value.

The “evolved 2x leverage with symmetric 2.5/2.5 ATR multipliers” we thought we were running was actually “default 10x leverage with whatever the default tp_rr_ratio is.”

Five seconds of print(vars(params)) after from_dict() would have shown this. We didn’t do it.

The Corrected Numbers #

Same BTC 304d, same 70/30 split, same evolved parameters — but mapped correctly to current schema fields:

  • Train PF: 1.494
  • OOS PF: 1.478
  • Ratio: 1.01

That’s not overfit. That’s one of the most stable Train/OOS ratios we’ve ever seen.

The strategy isn’t broken. The diagnosis was broken.

What Was Still True #

The corrected results are stable on BTC 304d, but cross-asset testing tells a less flattering story.

Eight crypto pairs, same 148-day window, same corrected parameters:

AssetTrain PFOOS PFRatio
ETH1.1540.6971.66
BNB1.5120.2137.10
AVAX0.5811.3020.45
LINK1.0550.5192.03
ARB0.6281.5270.41
DOT1.6471.9070.86
NEAR0.3581.4150.25

Ratio stdev (2.42) exceeds mean (1.82). When the spread of a metric is larger than its central tendency, you’re looking at noise.

DOT looked like the standout — Train 1.65, OOS 1.91, both strong. But splitting DOT’s 148 days into five ~30-day segments revealed Segment 1 alone (Oct-Nov 2025) carried PF 7.82 and the entire +1.67% return. The other four segments combined to -0.68%. The “cross-asset alpha” was one lucky month.

A walk-forward test confirmed: Segment 1 as in-sample, Segments 2-5 as out-of-sample. IS PF 7.82 → OOS PF 1.21. IS/OOS ratio 6.47 — the actual textbook overfit hiding inside an asset where the surface-level numbers looked good.

The Defenses #

Three layers, in order of effort/value:

1. Strict deserialization. Make your parameter loader reject unknown fields. In Python:

@dataclass(frozen=True, kw_only=True)
class DecisionParams:
    base_leverage: float = 10.0
    # ...
    
    @classmethod
    def from_dict(cls, d: dict) -> "DecisionParams":
        valid = {f.name for f in cls.__dataclass_fields__.values()}
        unknown = set(d.keys()) - valid
        if unknown:
            raise ValueError(f"Unknown fields: {unknown}")
        return cls(**{k: v for k, v in d.items() if k in valid})

The original from_dict() filtered to valid fields without raising on unknown fields. One missing raise cost seven experiments.

2. Print effective params before backtest. Three lines:

params = DecisionParams.from_dict(raw)
print(f"Effective: leverage={params.base_leverage}, sl={params.sl_atr_mult}, tp={params.tp_rr_ratio}")
assert params.base_leverage == raw.get("base_leverage", raw.get("leverage")), "leverage mismatch"

3. Pin parameter file schema version. When the framework’s schema changes, old parameter files should fail loudly, not silently degrade.

The New “Seven Don’ts” — Now Thirteen #

The original seven backtest discipline rules grew to thirteen after this incident. The six new ones come directly from these experiments:

  • Don’t trust experiments without schema validation. Print params before backtest.
  • Don’t make calls on datasets under 200 trading days. 148-day sub-windows of the same asset gave opposite diagnoses.
  • Don’t accept PF > 3 with under 30 trades. Default red flag.
  • Don’t ship strategies without cross-asset validation. Single-asset stability is necessary, not sufficient.
  • Don’t ignore stdev/mean ratio. Above 1 means noise, no matter how good the mean looks.
  • Don’t report PF without per-segment decomposition. Single-window summaries hide lucky-segment artifacts.

The Hard Part #

The original report sat in our archive for a day before the follow-up exposed it. If we had stopped at “Train PF 2.08 → OOS 0.94, ratio 2.21” we would have shared a confidently wrong diagnosis. The numbers were real. The story we told around them wasn’t.

Backtest results are easy to produce, easy to summarize, easy to share. Validating the assumptions behind the numbers is harder, slower, and less rewarding. But it’s the only step that distinguishes “we ran a thing and here’s what happened” from “we know what happened.”

If you only take one habit from this postmortem: print your effective params before every backtest. Five seconds. Saves seven experiments.

For walk-forward + multi-asset experiment scaffolding:

  • DigitalOcean — $200 credit, easy GPU/CPU droplets
  • HTStack — Hong Kong VPS, low-latency to Asia exchange APIs

Affiliate links — same price, supports dibi8.com.


Related: Moss Trade Bot Factory 2026 Review · Backtest OVERFIT 5 Patterns 2026 · Backtrader Python Backtesting

📦 Featured in collections

💬 Discussion