We ended Part 3 with the Shyamalan-eque twist that the simple ballast model we’ve been building has been a Bayesian model all along. Time to take a step back and look at the same model through a Bayesian lens.
I mentioned way back in Part 1 that we’d be gradually building this thing up, and that you can stop at any stage and you’d have a usable model. For readers who don’t have at least some basic probability and statistics background (to the point of, say, knowing what a probability distribution is), you will have a bit of a tougher climb from here. You’re welcome to stay with me, but what follows is a not-dumbed-down exploration of some fairly advanced concepts. If the ballast model is as far as you can get, no worries! It will serve you well in a wide variety of real-world modeling problems.
Ready? OK, let’s leave base camp and start the trek to the summit.
The number of made 3-pointers / 2-pointers / free throws allowed by a given team in a given game is a textbook case for the Binomial distribution. Binomial requires two parameters: an “n” which is the number of attempts, and a “p” which is the probability of success. We’re going to take the n’s as given and focus on modeling the p’s, which of course are the same 3P%, 2P% and FT% projected shooting percentages we’ve been dealing with all along.
So, to summarize where we are so far,
Number of 3PTM / 2PTM / FTM allowed by a given team in a given game is binomially distributed with n = 3PTA / 2PTA / FTA allowed which are given, and p = projected 3P% / projected 2P% / projected FT% allowed which must be estimated.
Just like in parts 1-3, we are coming up with our projected 3P% / 2P% / FT% allowed by combining two data sources: The overall averages, and the team’s season-to-date observed stats. Through our work with the ballast model, as well as our own intuition, we concluded that the weight given to the overall average should diminish as the sample size of the season-to-date stats increases. For the first game of the season where there are no season-to-date stats, we rely completely on the overall averages. Switching to Bayesian terminology, the overall averages are our prior estimates, and the projected percentages that we’re coming up with are our posterior estimates.
To simplify the discussion, let’s focus on the 3P% model – the others are functionally identical.
Back to our Miami example, we have:
Prior estimate of 3P% = 0.355
Season to date observed = 234 made in 658 attempts
Posterior estimate of 3P% = some kind of calculation using the 0.355, the 234 and the 658.
Now, we are using a binomial distribution with known n and unknown p. What makes Bayesian modeling work is that we can think of p itself as a random variable having its own distribution. We could get into a lot of philosophical discussion about what that means, but I’ve already done that so I won’t repeat it.
What makes Bayesian modeling in Excel work is a magical thing called “conjugate priors”. When you have a conjugate prior, all the advanced calculus involved in Bayesian probability calculations melts away and leaves you with simple, intuitive expressions for everything.
For a binomial distribution with known n and unknown p, the conjugate distribution that goes with it is called the beta distribution. If you look up the beta distribution you will see that it’s defined in a way that’s complicated and full of Greek letters. Fortunately, that’s the piece that melts away. For now, all you have to know about the beta distribution is that it lives between 0 and 1 (as it should, given that we’re using it for p) and it uses two parameters, alpha and beta. (Yes, beta parameter in a beta distribution. Don’t blame me, I didn’t invent it.)
So let’s recap the “Inception” type scenario we’re dealing with. Made baskets follow a binomial distribution with known n and unknown p. P has its own distribution, called a beta distribution, with its own parameters (“hyperparameters” if you’re fancy) alpha and beta. If you know alpha and beta, you know the distribution of p. If you know n and the distribution of p, you know the distribution of the number of made baskets. If you have to read this paragraph a few times to get it, that’s ok. Seems confusing at first but when it clicks, it’s like a spiritual awakening.
Conjugate distributions work by giving you a simple formula to transform prior hyperparameters + observed data into posterior hyperparameters. For the Binomial-Beta combination, the formulas are as follows:
Posterior alpha = prior alpha + observed number of successes
Posterior beta = prior beta + observed number of failures
Now, the mean of the beta distribution is alpha / (alpha + beta). From above, we know that the prior mean is 0.355. So, when we look at the prior hyperparameters there’s really only one unknown here, because we know that prior beta = 1.817 * prior alpha so that prior alpha / (prior alpha + prior beta) is 1 / 2.817 = 0.355.
Posterior alpha = prior alpha + observed number of successes = prior alpha + 234
Posterior beta = prior beta + observed number of failures = 1.817 * prior alpha + 424
Posterior estimate of p = posterior alpha / (posterior alpha + posterior beta) = (prior alpha + 234) / (2.817 * prior alpha + 658).
If that structure, (something + # of makes) / (something else + # of attempts), looks familiar – it should. Prior alpha is just a re-parameterization of our good old ballast; specifically, prior alpha is the overall average rate of 0.355 times the ballast.
So now all that’s remaining is to fit our prior alpha parameter to the data. We could do this by finding the prior alphas that minimize the squared error; if we do that, we will get a model that is identical to our Level 3 model. However, the upgraded technical sophistication we’ve introduced here allows us to fit our model in a better way using something called “maximum posterior likelihood”. Instead of minimizing the squared error, we maximize the logarithm of the probability of observing exactly what we’ve observed in the data set (the “loglikelihood”). For the Miami example, they gave up 14 3-pointers in 28 attempts. For any given prior alpha, we can calculate the posterior estimate of p using the formulas above, and we can use that to calculate the probability of exactly 14 makes in 28 attempts.
Now here’s the trickiest part of the whole thing. It would seem obvious that we’d use the binomial distribution with n = 28, x = 14 and p = our posterior estimate of p. However, the distribution is only binomial for FIXED p. If p is VARIABLE and beta distributed, the distribution turns into something called “beta-binomial”. The beta-binomial distribution is similar to the binomial distribution, but with higher variance. Why? Because the variance comes from two places: the natural variance inherent in the binomial distribution (“aleatory uncertainty”) and the variance that comes from the fact that p is unknown and must be estimated, potentially with estimation error (“epistemic uncertainty”).
As a mixture of binomial and beta, the beta-binomial distribution takes three parameters: the n from the binomial distribution, and the alpha and beta from the beta distribution.
There is no built-in Excel function for the beta-binomial distribution, so I had to google how to make my own in VBA:
Function beta_binomial_pmf(x As Integer, n As Integer, alpha As Double, beta As Double) As Double
beta_binomial_pmf = WorksheetFunction.Combin(n, x) * Exp(log_beta_function(x + alpha, n – x + beta) – log_beta_function(alpha, beta))
Function log_beta_function(alpha2 As Double, beta2 As Double) As Double
log_beta_function = WorksheetFunction.GammaLn(alpha2) + WorksheetFunction.GammaLn(beta2) – WorksheetFunction.GammaLn(alpha2 + beta2)
Okay, there’s a lot here. So let’s recap, using Miami 3-pointers allowed as an example. We’ll use 500 as our prior_alpha for now:
Season to date observed successes = 234
Season to date observed failures = 424
Prior alpha = 500
Prior beta = 1.817 * 500 = 908.4
Posterior alpha = 500 + 234 = 734
Posterior beta = 908.4 + 424 = 1332.4
Current game n = 28
Current game x = 14
Loglikelihood = ln(beta_binomial_pmf(14, 28, 734, 1332.4) = -3.12. (yes it’s supposed to be negative!)
Now all we have to do is calculate this across all shot types and all games, add up the loglikelihoods and find the prior alphas that maximize the total loglikelihood, and we have our Level 4 model:
Prior alpha for 3p = 2260.8
Prior alpha for 2p = 540.1
Prior alpha for ft = 5752.0
Total loglikelihood = -18133.6
Total squared error:
Note that using squared error, the level 4 model is worse than the level 3 model. That’s because the level 3 and 4 models are mathematically equivalent, the only difference being the way we fit them – in level 3 we minimized the squared error, in level 4 we maximized the loglikelihood.
Similar to level 3, which should not be surprising.
The prior alphas that we fit act kind of like a slider to adjust how much weight is given to the prior and how much is given to the new information. We can interpret it as follows:
Prior alpha will be higher if the prior is stronger AND/OR if the emerging data has a high ratio of noise to signal (like it does for free throws against).
Prior alpha will be lower if the prior is weaker AND/OR if the emerging data has a high ratio of signal to noise (like it does for 2 pointers against).
But how can a prior be strong or weak? We’re just using the overall average, it’s purposely not trying to predict anything so it’s pretty weak by definition…
…but does it have to be?