Backtest Engine and Statistical Analysis
1. Backtesting Philosophy
A backtest is a validation tool, not an optimization tool. The main risks:
- Overfitting: fitting parameters to historical data → the model "memorizes" the past instead of learning
- Lookahead bias: using future information not available at signal time
- Survivorship bias: testing only on assets that survived
- Transaction cost underestimation: ignoring slippage, fees, market impact
- Data snooping: testing many models on the same data → at least one will be profitable by chance
Our backtest uses the same codebase as the live strategies: the same
OrderflowEngine, the same RegimeDetector, the same strategies. The
only mock is the OrderManager (which records fills instead of sending
them to Deribit).
2. BacktestEngine — Architecture
2.1 Execution Flow
BacktestEngine.run(strategy, symbol, start_date, end_date)
│
├── _load_candles() → loads from DuckDB or uses provided candle_data
│ └── _load_from_duckdb(): SQL on agg_trades table → 1m OHLCV
│
├── MockOrderManager() → replaces strategy.order_manager
│
├── For each candle_i in candle_data:
│ ├── _feed_candle_to_engine(of_engine, candle)
│ │ └── Simulates 2 aggTrades: buy_volume + sell_volume
│ │
│ ├── _check_exits(open_trades, high, low, close)
│ │ ├── LONG: if low <= sl_price → exit at sl_price ("sl")
│ │ │ if high >= tp_price → exit at tp_price ("tp")
│ │ └── SHORT: mirrored
│ │
│ └── if i % scan_interval_candles == 0 and i >= 30:
│ ├── regime_detector.detect(candle_history) → updates regime
│ └── strategy.scan() → signals
│ └── For every signal: create BacktestTrade, add to open_trades
│
└── Close all open trades at end of data ("end_of_data")
└── BacktestMetrics.compute(trades) → performance metrics
2.2 Fill Simulation
def execute_generic_trade(self, ..., price, entry_type="market", ...):
fill_price = price
if entry_type == "market":
# Slippage: the market moves against you at fill time
slippage = price * self.slippage_pct # default 0.05%
fill_price += slippage if direction == "buy" else -slippage
# Deribit futures taker fee: 0.05%
# Applied in _compute_pnl()
Deribit fee structure:
Taker fee (market order): 0.05% of notional
Maker fee (limit order): 0.03% of notional
Fee per trade (entry + exit): 2 * 0.05% = 0.10% (round trip)
On BTC-PERPETUAL at $64500 with 0.01 BTC:
Notional = 64500 * 0.01 = $645
Entry fee = 645 * 0.0005 = $0.32
Exit fee = 645 * 0.0005 = $0.32
Total fee = $0.64 per round trip
2.3 SQL for OHLCV Reconstruction from Ticks
The backtest loads historical data from DuckDB, aggregating ticks into 1-minute candles:
SELECT
(timestamp_ms / 60000) * 60000 as bucket_ts,
FIRST(price ORDER BY timestamp_ms) as open,
MAX(price) as high,
MIN(price) as low,
LAST(price ORDER BY timestamp_ms) as close,
SUM(quantity) as volume,
SUM(CASE WHEN NOT is_buyer_maker THEN quantity ELSE 0 END) as buy_volume,
SUM(CASE WHEN is_buyer_maker THEN quantity ELSE 0 END) as sell_volume,
COUNT(*) as trade_count
FROM agg_trades
WHERE symbol = 'BTCUSDT'
AND timestamp_ms BETWEEN ? AND ?
GROUP BY bucket_ts
ORDER BY bucket_ts
Using FIRST() and LAST() instead of MIN(timestamp_ms) + subqueries
is DuckDB-specific and guarantees correct open/close even with
out-of-order trades.
3. Performance Metrics
3.1 Return and Drawdown
Total Return:
Maximum Drawdown (MDD):
where Peak(t) = max_{s <= t} Equity(s) (the running maximum up to t).
Calmar Ratio (drawdown-adjusted return):
Calmar > 1: annual return exceeds the maximum drawdown → good risk management.3.2 Sharpe Ratio
The Sharpe Ratio measures excess return per unit of risk:
In intraday trading the risk-free rate R_f is approximately 0 on short
timeframes. We compute the Sharpe on R-multiples (returns normalized
by each trade's risk):
R_i = (exit_price - entry_price) / |entry_price - sl_price| (long)
R_i = (entry_price - exit_price) / |sl_price - entry_price| (short)
Industry benchmarks: - Sharpe < 0.5: weak strategy - 0.5 - 1.0: acceptable - 1.0 - 2.0: good - > 2.0: excellent (hard to sustain over long periods)
3.3 Sortino Ratio
The Sortino Ratio only penalizes negative volatility (downside risk):
where sigma_downside = sqrt(E[min(R-MAR, 0)^2]) is the standard
deviation of returns below the MAR (Minimum Acceptable Return, often 0).
Sortino is preferable to Sharpe for asymmetric strategies (many small wins, rare large losses → low Sharpe but acceptable Sortino).
@staticmethod
def sortino_ratio(r_multiples: List[float], mar: float = 0.0) -> float:
returns = np.array(r_multiples)
downside = returns[returns < mar] - mar
if len(downside) == 0:
return float("inf")
sigma_down = np.sqrt(np.mean(downside ** 2))
return float(np.mean(returns) / sigma_down) if sigma_down > 0 else 0.0
3.4 Profit Factor
- PF > 1.0: profitable strategy
- PF > 1.5: good
- PF > 2.0: excellent
3.5 Payoff Ratio and Win Rate
The fundamental relationship between WR and Payoff:
With Payoff = 2 (avg_win = 2 * avg_loss): break-even WR = 1/3 = 33% With Payoff = 1 (avg_win = avg_loss): break-even WR = 50%
A system can be profitable with a low WR if the Payoff is high, and vice versa.
4. Monte Carlo Simulation — Bootstrap Resampling
4.1 Bootstrap Theory
The bootstrap method (Efron, 1979) estimates the distribution of a statistic by sampling with replacement from historical data. In trading:
- The historical trades
{R_1, R_2, ..., R_N}form the original sample - Each simulation: sample N values with replacement → an alternative possible sequence
- Run 1000 simulations → distribution of possible trade sequences
Key assumption: trades are statistically independent (no serial correlation). This is a simplification: trades within a correlated trend may not be independent. However, for systems with short holding periods (< 1 day) the assumption is reasonable.
4.2 Monte Carlo Algorithm
def run(self, trades, initial_equity=10000, n_trades=None):
pnls = [t.pnl_usd for t in trades]
r_mults = [t.r_multiple for t in trades]
n_trades = n_trades or len(pnls)
final_equities = np.zeros(self.n_simulations)
max_drawdowns = np.zeros(self.n_simulations)
sharpes = np.zeros(self.n_simulations)
for i in range(self.n_simulations):
# Resample with replacement
sim_pnls = np.random.choice(pnls, size=n_trades, replace=True)
sim_r = np.random.choice(r_mults, size=n_trades, replace=True)
# Equity path
equity_path = initial_equity + np.cumsum(np.concatenate([[0], sim_pnls]))
final_equities[i] = equity_path[-1]
# Max drawdown for this simulation
peak = np.maximum.accumulate(equity_path)
dd_pct = (peak - equity_path) / peak * 100
max_drawdowns[i] = np.max(dd_pct)
# Sharpe for this simulation
std_r = np.std(sim_r)
sharpes[i] = np.mean(sim_r) / std_r if std_r > 0 else 0.0
4.3 Statistical Output
Final equity distribution (after N trades):
P5 (worst 5%): [equity_p5] <- pessimistic scenario
P25: [equity_p25]
P50 (median): [equity_p50] <- most likely scenario
P75: [equity_p75]
P95 (best 5%): [equity_p95] <- optimistic scenario
Max Drawdown distribution:
DD_P50 (median): <- "normal" drawdown
DD_P75: <- drawdown in an adverse scenario
DD_P95 (worst 5%): <- drawdown in a very adverse scenario
Probability of ruin (equity < threshold, default 50% of capital):
Sharpe Confidence Interval (90%):
4.4 Interpreting the Results
Example output for a system with 100 historical trades, $10,000 equity, 1000 simulations:
MONTE CARLO SIMULATION (1000 runs x 100 trades)
====================================================
Initial Equity: $10,000.00
--- Final Equity Distribution ---
P5 (worst 5%): $ 8,450.00 <- scenario: you can lose up to $1,550
P25: $ 9,800.00
P50 (median): $11,200.00 <- most likely: +12%
P75: $12,600.00
P95 (best 5%): $14,300.00
--- Max Drawdown Distribution ---
P50 (median): 12.3%
P75: 18.7%
P95 (worst 5%): 31.2% <- rare but possible: a $3,120 drawdown
--- Probabilities ---
Profitable: 82.3% <- 82% of sequences are profitable
Ruin (50% loss): 0.2% <- very rare with conservative sizing
--- Sharpe (90% CI) ---
Mean: 0.842
Range: [0.312, 1.498] <- wide variability: more history needed
4.5 Trade Count for Statistical Significance
Law of Large Numbers: the sample mean converges to the true mean as N grows.
Standard error of the mean:
To get a 95% CI on the Sharpe with width < 0.5:
With sigma = 1 (R-multiples with std=1), z = 1.96:
You need at least 62 trades for a statistically significant Sharpe with ±0.25 precision. In practice, > 100 trades is recommended.
With < 30 trades any performance statistic is noise — the scoring engine correctly applies a 0.5 default in that case.
5. Journal Analytics — Machine Learning Dataset
The TradeLogger saves, for every trade, the full context at entry time:
@dataclass
class TradeSnapshot:
# Identity
trade_id: str
strategy: str
symbol: str
direction: str
timestamp_ms: int
# Prices
entry_price: float
sl_price: float
tp_price: float
quantity: float
# Orderflow context (ML features)
regime: str
cvd_1m: float
cvd_5m: float
cvd_15m: float
book_imbalance: float
aggression_ratio: float
volume_zscore: float
vwap_z: float
oi_change_pct: float
kyle_lambda: float
is_absorption: bool
is_liq_vacuum: bool
# Outcome (filled at exit)
exit_price: float = 0.0
exit_reason: str = "" # "tp", "sl", "manual"
pnl_usd: float = 0.0
r_multiple: float = 0.0
duration_minutes: float = 0.0
win: bool = False
ML use: after 200+ trades there is enough data to train a binary classifier (win/loss) using the orderflow features as inputs. Possible approaches: - Logistic Regression (interpretable baseline) - Gradient Boosting (XGBoost/LightGBM) to capture non-linearities - Random Forest for feature importance
The target: predict whether a trade will win → feed the predicted winrate into factor 3 of the risk engine.