*All of the data and models described in this series of articles can be downloaded here:*

In this series of articles, I’m going to show you how to build a Bayesian model to solve a “small data” problem.

We’re going to start very simple, so simple that you’ll be doing Bayesian analysis without even realizing it…kind of like how your mother would hide vegetables in your food! We will gradually ramp up the complexity and explore some of the theoretical basis behind what we’re doing. It will be set up so that you can stop after any stage if it’s getting too heavy and you will end up with the tools to build, and an example of, a usable model. If you’re able to stay with me all the way, the “end boss” model is something pretty cool that I’ve developed myself that, at least as far as my research can tell, nobody else has ever done before.

Oh yeah, and we’re doing the whole thing in Excel.

To demonstrate the models, I’ve selected a real-world problem that is well-suited to the topic of “small data” and can be easily applied directly to the betting markets: the problem of defensive shooting percentages in the NBA. That is, to what extent are a team’s shooting percentages allowed in a given game on 3 point shots, 2 point shots and free throws (yes, free throws) related to that team’s season-to-date observed shooting percentages allowed on each of those shot types?

We’ll be using data from the last full NBA season, 2018-2019. During that season, the league averages for 3P%, 2P% and FT% were 35.5%, 52.0% and 76.6%.

Let’s illustrate how the models will work by zooming in on a randomly selected game during the 2018-2019 season: Orlando at Miami on December 4. This was Miami’s 23^{rd} game and Orlando’s 24^{th} game of the season. If we add up the 22 Miami games and 23 Orlando games prior to December 4, we get the following defensive shooting percentages:

**Miami 3P% allowed: 234 / 658 = 35.6%**

**Miami 2P% allowed: 639 / 1299 = 49.2%**

**Miami FT% allowed: 431 / 567 = 76.0%**

**Orlando 3P% allowed: 255 / 705 = 36.2%**

**Orlando 2P% allowed: 664 / 1290 = 51.5%**

**Orlando FT% allowed: 394 / 508 = 77.6%**

If we zoom in on the Miami 2P% allowed of 49.2%, that is quite a bit lower (better) than the league average of 52.0%. There are two possible reasons why:

- If we believe it’s because of Miami’s defensive ability, then we would project them to allow a 2P% of 49.2% in a future game against an average offense.
- If we believe it’s because of good luck, then we would project them to allow a 2P% of 52.0% in a future game against an average offense.
- If we believe it’s some combination of Miami’s defensive ability and good luck, then we would project them to allow a 2P% somewhere in between 49.2% and 52.0% in a future game against an average offense.

Our first model is barely a model at all, so we’re going to call it “level 1”. Let’s suppose that we believe that the variance in defensive shooting percentages from team to team is entirely because of random luck. We’re going to quantify that random luck and remove it from the stats in order to project each team’s defensive ability in this game.

First let’s calculate each team’s points allowed per game.

Miami: (3 * 234 + 2 * 639 + 1 * 431) / (22 games) = 109.6 PPG allowed.

Orlando: (3 * 255 + 2 * 664 + 1 * 394) / (23 games) = 107.7 PPG allowed.

Next, we will adjust those PPG stats by replacing the actual shooting percentages with the league averages.

Miami: (3 * 658 * 35.5% + 2 * 1299 * 52.0% + 1 * 567 * 76.6%) / (22 games) = 111.4 adjusted PPG allowed.

Orlando: (3 * 705 * 35.5% + 2 * 1290 * 52.0% + 1 * 508 * 76.6%) / (23 games) = 107.5 adjusted PPG allowed.

How can we use these adjusted PPG allowed stats to form a betting strategy? Let’s suppose that any difference between a team’s actual PPG allowed and their adjusted PPG allowed is predictive signal that is not accounted for in the betting lines. That would mean that Miami’s defense is 109.6 – 111.4 = 1.8 PPG **worse** and Orlando’s defense is 107.7 – 107.5 = 0.2 PPG **better** than the betting markets think they are. Therefore,

- Betting on Miami would be
**unfavorable**by 2.0 points; - Betting on Orlando would be
**favorable**by 2.0 points; - Betting on Over would be
**favorable**by 1.6 points; - Betting on Under would be
**unfavorable**by 1.6 points.

Let’s create a threshold of 1.0 points minimum advantage that we believe is sufficient to overcome the vig and create a +EV situation. In this game, we would make bets on Orlando (at closing spread of +2.5) and Over (at closing total of 208). Orlando won the game 105-90, so we would have split our bets for a net result of -0.1 units (assuming -110 lines).

We can back-test our model using the entire 2018-19 season, giving the following results:

Build stupid models, get stupid results. Over the course of the season, our model lost 52.5 units and had a -3.2% ROI. The bet volume is a big red flag – our model made 1512 bets in 1230 games, and it’s unlikely that a market as developed and liquid as NBA sides and totals would be that wrong, that often.

So our Level 1 model is clearly no good. And yet, there are a couple of things in this back test that give us a glimmer of hope:

- The win-loss record of 756-735 was not good enough to beat the vig, but it was better than 50%. Was it significantly better than 50%? We can answer this question by borrowing a technique from our friendly but misguided neighbors, the frequentists. We ask ourselves, “supposing we were flipping coins and had a true win probability of 50%, what is our likelihood of observing at least 748 wins out of 1471 results?” This can be answered easily in EXCEL with the formula 1 – BINOM.DIST(755.5,1512,0.5,TRUE) = 0.302. So our back test result is not very statistically significant and could easily have been caused by random good luck.

- Our model did well in October and November and then got worse as the season progressed. If we accept that our original hypothesis was wrong and there’s at least some predictive signal in the team’s actual shooting percentages allowed, it seems like it would be the most wrong late in the season when the actual percentages have significant sample sizes underlying them, right?

## 2 thoughts on “Building a Bayesian Model: Part 1”