METHODOLOGY

How we validated EWA.

EWA attributes each possession's win-probability swing across the ten players on the court via ridge regression. The pregame projection layer aggregates those ratings into team-level forecasts. Below: rolling-origin backtest, calibration, sensitivity sweeps, lineage, and code links. Plain-English version is on /about.

Rolling-origin validation · 4 chronological folds

fold cutoffs (test 2021-22 → 2024-25)

1,660

total test games across folds

4 / 4

folds with Brier & log-loss CIs excluding zero

Roster-aware EWA improves over team-only on the pooled point estimate: Brier −3.58% (4/4 folds CI-exclude zero), log-loss −2.66% (4/4), margin RMSE −1.59% (3/4). Accuracy gains +3.40 pp pooled and is positive in every fold, but per-fold n=400-440 underpowers the individual accuracy CIs. Market odds average 67.7% accurate across folds — reported as benchmark, not target.

Five models, side by side (most-recent fold)

The 2024-25 fold (n_train = 5,822, n_test = 401) shown as a representative slice. All five models fit on the same train games and scored on the same test games. EWA uses the roster-aware aggregate (each team's most recent 30 train games). The three other folds (2021-22, 2022-23, 2023-24) show the same shape — see the rolling-origin table below for per-fold deltas. Lower is better for Brier and margin RMSE; higher is better for accuracy. Bracketed numbers are 95% bootstrap CIs.

Model	Brier	Accuracy	Margin RMSE
Naive (50/50)	0.2500 [0.250, 0.250]	50% expected —	15.75 [14.7, 16.7]
Home court only	0.2456 [0.241, 0.251]	56.9% [51.9, 61.6]	15.58 [14.5, 16.5]
Team identity (no players)	0.2451 [0.240, 0.251]	58.1% [53.4, 62.8]	15.56 [14.5, 16.6]
EWA (roster-aware)	0.2365 [0.228, 0.244]	59.4% [54.4, 64.3]	15.31 [14.2, 16.3]
Market (Vegas, de-vigged)	0.2011 [0.184, 0.218]	67.3% [62.6, 72.1]	N/A —

Market is included as benchmark/context. The accuracy gap (~8 pp pooled across folds) reflects information EWA does not use — line movement, sharp action, real-time injuries. We don't try to close it on this page.

EWA vs team-only — paired-bootstrap delta CIs

Across the 4 folds, here's how often EWA's improvement over team-only is statistically distinguishable from zero. CIs that exclude zero are paired-bootstrap CIs computed within each individual fold (1,000 resamples, n ≈ 400-440 per fold).

Log-loss

✓ excludes 0

−2.66%

95% CI 4 / 4 folds

EWA reduces log-loss vs team-only in all 4 folds; each fold's CI excludes zero.

Brier

✓ excludes 0

−3.58%

95% CI 4 / 4 folds

EWA reduces Brier vs team-only in all 4 folds; each fold's CI excludes zero.

Margin RMSE

✓ excludes 0

−1.59%

95% CI 3 / 4 folds

Predicted point margin tightens vs team-only in 4/4 folds; CI excludes zero in 3/4.

Accuracy

✗ crosses 0

+3.40 pp

95% CI 0 / 4 folds

Positive in all 4 folds (+1.2 to +4.6 pp), but per-fold n=400-440 underpowers individual CIs.

Rolling-origin — every fold, not the best fold

Each row is an independent chronological fold: train strictly on games from prior seasons, test on one season's odds-matched games. The pattern holds across all four cutoffs — Brier and log-loss CI-exclude zero in 4/4 folds, margin RMSE in 3/4. Same direction, same approximate magnitude, every time.

Test	n_train	n_test	EWA acc	Mkt acc	Δ Brier	Δ Log-loss	Δ RMSE
2021-22	2,136	417	59.2%	69.8%	+3.95% ✓	+2.88% ✓	+1.82% ✓
2022-23	3,366	404	60.9%	64.4%	+3.11% ✓	+2.34% ✓	+1.23% ✗
2023-24	4,593	438	57.1%	69.2%	+3.73% ✓	+2.79% ✓	+1.71% ✓
2024-25	5,822	401	59.4%	67.3%	+3.51% ✓	+2.63% ✓	+1.61% ✓

✓ marks deltas whose 95% CI excludes zero within that fold. The single-table comparison above (5 models, 2024-25 fold) is the most-recent fold; the other three cutoffs show the same shape. The EWA-aware improvement is not a single-cutoff artifact.

Window sensitivity — robustness, not tuning

We use a roster-aware recent-usage aggregate, defaulting to each team's last 30 games. Sensitivity checks across 15 / 30 / 45 / 60 games show the EWA signal is strongest in recent windows and fades as older roster usage is included — consistent with roster drift over time. The default of 30 was set as a disciplined mid-window value, not because it dominates any single metric.

N	EWA Brier	Δ Brier	Δ Log-loss	Δ Margin RMSE
15	0.2438	+2.81% ✓	+2.11% ✓	+1.24% ✓
30 (default)	0.2449	+2.39% ✓	+1.79% ✓	+0.97% ✓
45	0.2473	+1.44% ✗	+1.08% ✗	+0.62% ✓
60	0.2478	+1.24% ✗	+0.93% ✗	+0.55% ✓

✓ marks deltas whose 95% CI excludes zero. The story is robust across recent windows, especially 15 ≤ N ≤ 30: three of four metrics (Brier, log-loss, margin RMSE) are statistically distinguishable. At N = 45 and 60 the aggregate grows stale and only margin RMSE remains significant. We publish at the default window rather than the best-on-test window.

Calibration

When EWA says a team has a 65% chance to win, do they actually win about 65% of the time? Each dot below is a probability bin from the held-out games — predicted on the x-axis, actual win rate on the y-axis. Perfect calibration is the dashed diagonal. Dot size shows games per bin.

Central bins are the populated ones in this fold (n = 93, 181, 128, 22). Calibration drifts a little at the high end on this 438-game test set — fewer games per bin means more sampling noise. We treat calibration as a property to monitor across runs, not a single number.

Why raw plus-minus lies

The simplest impact stat is raw plus-minus — point differential while a player is on the court. It looks honest and breaks immediately. In recent seasons, players like Payton Pritchard and Luke Kornet have posted higher raw on-court plus-minus than Stephen Curry, Giannis Antetokounmpo, and Luka Dončić. Not because they generate more impact — because they happen to share the floor with stars on winning teams.

Ridge regression with player-level controls is what fixes this. EWA splits credit in a way that controls for teammates and opponents, so a strong rotation player on a great team doesn't inherit his teammates' impact. That's the attribution layer. Shrinkage then ensures small-sample players don't ride a hot streak to the top of the rankings.

Concrete example

Nikola Jokić's rate over the last three seasons is +8.16 EWA per 100 possessions. Decomposed by role, 84% of that comes from assisting — not scoring, not rebounding. His best pair with Jamal Murray adds +1.4 EWA together; strong, but they underperform what you'd expect from stacking their individual numbers. That's the kind of read no box score or single-number metric gives you.

See Jokić's full breakdown

How the scores are built

Win probability per possession

A sequence model trained on play-by-play estimates win probability after every event. The change in win probability across each possession (WPA) is the unit of credit.

Ridge attribution

A regularized regression splits each possession’s WPA across the ten players on court while controlling for teammates, opponents, and home court. This is the regularized adjusted plus-minus tradition (Sill 2010), with role-aware interactions added on top.

Low-sample shrinkage + Empirical Bayes

Players with few possessions get pulled toward the population mean by both a count-based shrinkage (count / (count + k)) and an Empirical Bayes step. This is what keeps a 100-possession rookie from showing up next to Jokić on the leaderboard.

Roster-aware pregame aggregate

For game prediction, per-team EWA aggregates use each team's most recent 30 train games — not a static average across the whole training period. This keeps the predictor honest about mid-season trades and roster turnover.

Academic lineage

EWA isn't a new technique. It's an honest reassembly of established methods with a transparent validation harness on top.

RAPM

Sill (2010 Sloan)

Regularized adjusted plus-minus via ridge regression. The base technique behind EWA’s attribution layer.

WPA

Multiple authors

Possession-level win-probability swings as a credit signal. EWA inherits this framing rather than the raw point-differential one.

SPM / BPM

Daniel Myers

Statistical / Box Plus-Minus. Where role and box-stat information enter as priors. EWA’s role-aware interactions are in this tradition.

EPM / DARKO

Snarr / Medvedovsky

The two strongest public predictive metrics. EWA borrows their commitment to chronological holdout testing and roster-aware aggregation.

What EWA is not

Not a market substitute.Vegas pregame odds reflect sharp action, line movement, real-time injuries, and decade-long iteration. The ~8 pp average accuracy gap across the 4 folds is real and we don't claim to close it.
Not a betting tool. No picks, no edge claims, no recommendations. The published model is for analysis.
Not a replacement for watching the game. EWA captures aggregate impact on win probability. It cannot tell you why a player is great today, only that they have been.

Limitations we're honest about

Reading these openly is the price of asking you to trust the rest. Every limitation below is on the roadmap and labeled in our internal validation reports.

Win-probability labeler is not strictly leakage-free. The current harness uses the production WP model to label every possession's WPA. That model is trained on multi-season play-by-play that may overlap our test window. The player-attribution side (ridge fit on train games only) is clean, but the label-side discipline isn't — and that's the biggest open caveat on this page. Retraining the WP labeler with a strict pre-test cutoff is the next item on the roadmap.
App-surface EWA uses a slightly less-shrunken estimator than the headline tables on this page. The leaderboards, player pages, and game pages compute cumulative EWA as ridge per-100 × possessions in the selected window — so a custom date range, season, or game-type filter all stay self-consistent (per-100 × poss = cumulative). The published static EWA tables apply an additional Empirical-Bayes shrinkage layer on top of ridge, which makes them ~5–10% smaller for top players but is fixed to the windows the engine pre-computes. Same direction, same ranking, just less shrunk than the static tables. The validation work above is on the static EB-shrunken estimator.
Rolling-origin validation here is 4 chronological folds (test seasons 2021-22 through 2024-25). The signal is consistent across all 4. We have not yet sensitivity-checked across alpha or random seed beyond the recent-games window — those passes are queued.
Per-player minute projection isn't modeled yet. The current pregame layer uses each team's recent per-game possession profile rather than predicting individual minutes per game. EPM and DARKO have richer minute models.
Market odds are a benchmark on this page, not a target. The accuracy gap to Vegas (averaging ~8 pp across the 4 folds) is real and reflects information we don't have (sharp action, line movement, real-time injuries). We don't claim to beat the market — predictions surface the model's reasoning and a public track record on /predictions, not picks against a line.
Bootstrap CIs are on a single seed. We have not yet sensitivity-checked across alpha, shrinkage_k, or random seed beyond the recent-games window — that's the next robustness pass.

Audit it yourself

The validation code is open and runnable. The numbers above came from scripts/validate_pregame_prediction.py with --recent-games-per-team 30 on a chronological holdout. The window-sensitivity sweep ran via scripts/sweep_recent_games_window.sh. The attribution math lives in unified_scores.py.

Validation harness: scripts/validate_pregame_prediction.pyEngine + ridge attribution: unified_scores.py

What's coming

Alpha + seed sensitivity

Sweep ridge alpha (2,500 / 5,000 / 7,500 / 10,000) and bootstrap seeds across the 4 rolling-origin folds. Demonstrates the result is not a single-hyperparameter or single-seed artifact.

Per-player minute projection

Replace per-team possession averages with per-player rolling minute estimates. Closes part of the gap to EPM/DARKO's richer minute models.

Lineup sensitivity

Counterfactual calculator: "if Player X is out, EWA moves N points." The most direct expression of EWA's player-level attribution and the natural foundation for a paid analytics tier.

Live pregame layer

Daily-refreshed pregame projections that incorporate the day's active rosters and inactives. Today's harness uses recent training data; the live layer uses recent live data.

Frozen pre-cutoff WP labeler

Retrain the win-probability model with a strict cutoff before each test window so the WPA labels themselves are leakage-free. The current harness uses the production WP model and discloses that limitation; this closes it.

FAQ

How is EWA different from traditional plus/minus?

Plus/minus measures point differential while you're on court. EWA measures how much each possession changed win probability — weighting high-leverage moments more — and then splits credit fairly via ridge regression. Plus/minus conflates your impact with your teammates'.

Why do some great players rank lower than expected?

EWA captures context. A star on a dominant team faces fewer high-leverage possessions because the game state is already stable. The public scores also apply shrinkage, so lower-volume players get pulled toward the middle.

How often is the data updated?

Score artifacts refresh on a daily cadence; the underlying win-probability model is retrained on a slower review cycle. The footer shows the most recent promoted run currently being served.

How does EWA compare to Vegas / the betting market?

Market is a fifth column in our validation table — we have multi-season de-vigged moneylines for 1,954 NBA games matched cleanly to game IDs. Across all 4 rolling-origin folds, market accuracy averages ~67.7%; roster-aware EWA averages ~59.2%. The ~8 pp gap is real and reflects information markets have that we don't (sharp action, line movement, real-time injuries). We report it as a benchmark, not a target.

Why these four folds (2021-22 through 2024-25)?

Those are the four most recent NBA seasons where we have both play-by-play data and de-vigged pregame moneylines, and where each fold has a strictly older training set available. The pattern (Brier and log-loss CIs excluding zero, margin RMSE excluding zero in 3/4) holds across every fold tested.

Can EWA pick tonight's winners?

Yes — that's what /predictions is. Every game we predict, you can see what the model said and (after the game) whether it called the winner. Across the four published rolling-origin folds, EWA accuracy averages 59.2%; the de-vigged Vegas market averages 67.7%. EWA beats team-only baselines but doesn't approach the market — Vegas has information we don't (sharp action, line movement, real-time injuries). The page tracks the model's live record so you can see exactly how it's doing.

Explore the data