Statistical tests tell you whether score differences are real or could be due to chance.
Wilson Score CI — confidence intervals on win rates that work correctly even with small sample sizes (unlike normal approximation).
Bootstrap CI — 1,000-resample confidence intervals on score differences between models, giving you a credible range for "how much better is Model A than Model B."
Wilcoxon Signed-Rank — non-parametric pairwise significance test. Doesn't assume scores are normally distributed. Reports p-values for every model pair.
Holm–Bonferroni Correction — adjusts p-values when comparing multiple models simultaneously, preventing false positives from repeated testing.
Statistical Power Analysis — estimates whether your sample size is large enough to detect meaningful differences, so you know if "not significant" means "no difference" or "not enough data."