Introduction to Empirical Bayes using Baseball Statistics

This post draws extensively on material from the book Introduction to Empirical Bayes by David Robinson. I've found the book to be an excellent introduction to Empirical Bayes, and a great introduction to the Lahman Baseball dataset as well. You can find the book here. I strongly suggest it!

Empirical Bayes

Empirical Bayes refers to a "hacker" approach to a specific kind of Bayesian Analysis. Empirical Bayes can be successfully employed when data can be modelled by a Binomial distribution depending on a parameter $N_i$ and $p_i$, where $p_i$ is the quantity we want to estimate for each data point. Crucially, it may be the case that each point has a specific $p_i$ that is not, in general, equal to any other $p_i$. In that case, we can think of the set $\{p_i\}$ as being drawn from a distribution that has a well-defined mean and variance. If the distribution is not uniform, then the data points should not be considered independent of each other-each point contains information about the prior values of the other points. Therefore, we would like an approach that "draws power" from all the data points to help us perform inference on the most likely value $p_k$ for a data point $k$ for which we may otherwise have limited evidence (for example, if we wanted to know the hit rate for a baseball player who has only played five games, instead of 1,000).

Notice that the above scenario is different from what we often hope for in science. In scientific experiments, we will often set up an experimental design to measure a variable $p$, and the experiment will be replicated several times to generate various measurements of this variable. In this case, each replicate is considered to be a binomial draw using the same coin, $p$, and the variability is attributed to random sampling. These two scenarios are similar, but fundamentally very different. In the data, the way this will play out is in the variance. Tossing many different coins with binomial sampling will result in a greater variance than tossing a single coin with binomial sampling.

Empirical Bayes is the result of placing a Beta prior on a Binomial likelihood and evaluating the result (since the Beta is a conjugate prior for the Binomial, the result is an "updated" Beta distribution). The Beta distribution depends on two hyperparameters, $\alpha$ and $\beta$, and a proper Bayesian calls for priors on these two hyperparameters (and priors on the priors of the priors, etc...), Empirical Bayes will be happy with ignoring any further priors.

Getting started

To start, I will load some default libraries in R, including the Lahman dataset.

In [17]:
library(dplyr)
library(stats4)
library(ggplot2)
library(tidyr)
library(viridis)
library(Lahman)

## set universal plot size:
options(repr.plot.width=5, repr.plot.height=3)
Loading required package: viridisLite

To motivate the problem, let's start with a simulation. Suppose we simulate 1 million baseball players, each of whom have exactly 300 at-bats (chances to hit the ball; I'm not a baseball player). For each player, we will draw a coin from a beta distribution, $p_i \sim \mathrm{Beta}(81, 219, p)$, and then we will simulate their hitting career with a Binomial distribution generated using that coin ($\mathrm{Bin}(N, p_i)$).

In [2]:
# Simulate player statistics. For each player, choose a coin from the beta
# distribution, Beta(81, 219). Then, simulate the number of balls they hit 
# out of 300 at-bats, given the coin they were assigned.
num_trial <- 10e6

sims <- data_frame(
                   true_average = rbeta(num_trial, 81, 219),
                   hits = rbinom(num_trial, 300, true_average)
        )

sims %>% head(5)
true_averagehits
0.3137426 81
0.3141298106
0.2944093 97
0.2468624 78
0.2416608 62

Now that we have a dataset, we can ask a range of questions. For example, what are the range of hit-rates ($p_i$'s) of players that hit exactly 100/300 balls?

In [3]:
# look at only players who hit exactly 100/300 balls
hit_100 <- sims %>%
  filter(hits == 100)

# Make a histogram of the players who hit 100 balls out of 300 attempts
# Also draw a KDE on top
sims %>%
    filter(hits == 100) %>%
        ggplot(aes(true_average), color=factor(hits)) +
        geom_histogram(aes(y=..density..)) +
        geom_density() + 
        labs(x = 'True average of players wth H hits / 300 total bats',
             y = 'Density', color="H")

# Print the mean average for the players who hit 100/300
median(hit_100$true_average)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0.301503283873861

The above graph shows us that we would not be surprised if the players had a hit-rate between 0.25 and 0.35. However, if we were told that a player with a hit-rate of 0.4 scored 100 hits out of 300 at-bats, we ought to be very sceptical about the claimed hit-rate for that player.

Analyzing the Lahman dataset

Someone walks up and asks you who the best batter of all time is. Confident you can find the answer, you ask your database for the batter with the highest hit rate, computed as hits/total bats. The database retains a list of complete strangers with relatively few total batting attempts, and none of the all-time famous players. Clearly, the players with little evidence are dominating this calculation: They are outliers. But what is a good, principled way to downplay these outliers?

One way to downplay outliers would be to downgrade players with spectacular hit-rates towards the average rate in a manner that depends directly on the evidence they have for their hit rate. So, players with little evidence would be "shrunk" aggressively towards the average, and players with a lot of evidence wouldn't be shrunk at all. The "shrinkage" will be mediated by the prior distribution, which depends on two hyperparameters, $\alpha$ and $\beta$. The evidence is incorporated in the likelihood, which is a binomial distribution. Combined, they generate a posterior likelihood that incorporates information about our prior beliefs AND the evidence for each player.

The obvious next step is to determine the correct shape of the prior. One way to get a (very good) idea about what the prior should look like is to look at the batting proportions of players who have been playing for long enough to gather credible statistics on their hit rates, then fit a beta-binomial curve to that data. This will get us the hyperparameters $\alpha$ and $\beta$ (but crucially it will not get us anything about the uncertainty in this parameter; for that we would need an MCMC sampler).

Before we can do that, we should get the data, clean it up, and limit ourselves to Batters, since Pitchers are apparently terrible at Batting and could introduce systematic bias into our prior.

In [18]:
# At-bats is the total number of bats, AB
# H is the number of hits
# remove pitchers (who are bad at batting) using dplyr's anti_join function
career <- Batting %>%
    filter(AB > 0) %>%
        anti_join(Pitching, by='playerID') %>%
            group_by(playerID) %>%
                summarize(H=sum(H), AB=sum(AB)) %>%
                    mutate(average=H/AB)

# include names with playerID
career <- Master %>%
    tbl_df() %>%
        dplyr::select(playerID, nameFirst, nameLast) %>%
            unite(name, nameFirst, nameLast, sep=" ") %>%
                inner_join(career, by='playerID') %>%
                dplyr::select(-playerID)

# view the dataframe
career %>% head(5)
nameHABaverage
Hank Aaron 3771 12364 0.3049984
Tommie Aaron 216 944 0.2288136
Andy Abad 2 21 0.0952381
John Abadie 11 49 0.2244898
Ed Abbaticchio 772 3044 0.2536137

Let's figure out what the prior looks like, if we restrict ourselves to players for whom we have registered at least 500 at-bats.

In [19]:
# plot a histogram of batting averages
# but only for batters > 500 total bats
career %>%
    filter(AB > 500) %>%
        ggplot(aes(average)) +
        geom_density() + 
        geom_histogram(aes(y=..density..))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For these players, we can now fit a Beta-Binomial curve using a Maximum Likelihood Estimate method:

In [20]:
# filter for at bats > 500
career_filtered <- career %>%
                      filter(AB>500)

# log-likelihood function
ll <- function(alpha, beta){
    # Calculates the log-likelihood function using a betabinomial distribution
    # for all the players in the career_filtered dataset
    # Args:
    # alpha, beta - floats. Parameters for the betabinomial distribution
    # 
    # Output:
    # log-likelihood
    x <- career_filtered$H
    total <- career_filtered$AB

    -sum(VGAM::dbetabinom.ab(x, total, alpha, beta, log=TRUE))
    }

# calculate the maximum likelihood estimate:
m <- mle(ll, start=list(alpha=1, beta=10), method="L-BFGS-B",
         lower=c(0.0001, .1))

# extract the coefficients
ab <- coef(m)
alpha0 <- ab[1]
beta0 <- ab[2]

# print coefficients
cat(alpha0, beta0)
101.8363 288.9097

We can see that the MLE estimates are $\alpha = 101$ and $\beta = 288$. This means the average player is expected to have a hit rate of 0.35 on average! It also means that players who have batted less than 400 times are considered to have insufficient evidence to believe their raw estimates. On the other hand, this prior states that we ought to believe the raw estimates for batters who have played more than 4,000 times. Using these two numbers, we can now shrink the expected hit-rates of each player:

In [57]:
# for each player, calculate the shrunken estimates:
career_eb <- career %>%
    mutate(eb_estimate=(H + alpha0)/(AB + alpha0 + beta0))

# plot the raw estimates vs the shrunken estimates for each player,
# and color each player by the total number of games played
career_eb %>%
    ggplot(aes(x=average, y=eb_estimate)) +
    geom_point(alpha=0.2, aes(colour=log10(AB))) +
    geom_abline(slope=1, color='red') +
    geom_abline(slope=0, intercept=alpha0/(alpha0 + beta0), color='blue') +
    scale_color_viridis() + 
    xlab("Raw Hit Rate") + 
    ylab("Empirical Bayes Estimates")

Notice how the players with tens of thousands of at-bats (yellow) all fall close to the red line that marks $y=x$. These players are not shrunk at all: Their evidence is so strung it overpowers the prior. On the other hand, dark purple dots tend to lie near the blue line that marks the prior mean, $\alpha/(\alpha + \beta)$. The dark purple dots are players who have barely played any games, and therefore they are aggressively shrunk towards the average estimate regardless of their actual hit rate.

Credible Intervals

Having established our empirical prior, we can now perform computation rapidly and easily on our data. For example, one thing we can do is compute 95% credible intervals for each players. All we need to do is compute the correct percentiles using each players specific beta function, and we're done! The code below implements that in a couple of lines.

In [10]:
# calculate credible intervals for the players with more than 10e4 at-bats
top_players <- career_eb %>%
    filter(AB > 10000) %>%
        # For each player, calculate the low and high percentiles using
        # their specific beta function which incorporates the prior + evidence
        mutate(low=qbeta(0.025, alpha0 + H, beta0 + AB - H),
               high=qbeta(0.975, alpha0 + H, beta0 + AB - H)) %>%
            mutate(name=reorder(name, eb_estimate))

# plot the top players:
top_players %>%
    ggplot(aes(eb_estimate, name)) +
    geom_point(color='blue') +
    geom_errorbarh(aes(xmin=low, xmax=high)) +
    geom_vline(xintercept = alpha0/(alpha0 + beta0), color='red', lty=2) +
    xlab("Estimated average, with 95% credible interval") +
    ylab("Player")

Hypothesis Testing

Another thing we can do rapidly is test hypotheses. Suppose we wanted to know whether a particular player $i$ has a hit rate $p_i >= 0.3$. We could quantify the probability that we mistakenly claimed a player had a hit-rate greater than 0.3 when in reality the player has a hit-rate below 0.3. In other words, we could quantify the probability that we made a classification mistake. This probability corresponds to the Posterior Error Probability, PEP:

$$ \mathrm{PEP} = \int_0^{0.3} \mathrm{Beta}(\alpha + H, \beta + AB - H, p) \mathrm{d}p $$

The marvelous thing about the PEP is that we can use it to rapidly calculate q-values. The PEP essentially measures how likely it is that the null hypothesis is true ($H_0:~p_i < 0.3$) given the evidence and the prior put together.

Suppose now that we took the $n$ best players from the dataset and asked what the average number of mistaken rejections of the null happened, given their PEPs? It makes sense that the average number of mistaken rejections should be proportional to the mean PEP of the players we selected. This yields the false discovery rate for that group! If the mean PEP is $x$, then the average number of false positives we would expect in that group is $nx$.

Marvelously, if we allow $n$ to vary from 1 to $K$ (the maximum number of players in the dataset), and if we calculate the mean PEP cumulatively on the ordered set of PEPs, we will end up with a list of numbers. This list of numbers is known as the set of q-values for each player. The q-value is therefore a measure of the probability that the null hypothesis is true given the player data. We can use this interpretation to choose a significance threshold that, on average, will control the number of false positives below a desired rate.

The code below implements the PEP first, then the q-value computation.

In [59]:
# calculate the PEP for all players using the beta CDF
career_eb <- career_eb %>%
    mutate(PEP=pbeta(0.3, H + alpha0, AB - H + beta0))

career_eb %>%
    head(5)
nameHABaverageeb_estimatePEP
Hank Aaron 3771 12364 0.3049984 0.3036388 0.1858700
Tommie Aaron 216 944 0.2288136 0.2381249 0.9999998
Andy Abad 2 21 0.0952381 0.2521853 0.9849976
John Abadie 11 49 0.2244898 0.2565942 0.9789573
Ed Abbaticchio 772 3044 0.2536137 0.2544107 1.0000000
In [67]:
# calculate the q-value of the i'th player (players are sorted lowest to 
# highest PEP) by calculating the cumulative PEP between the 0..ith players.
career_eb <- career_eb %>%
    arrange(PEP) %>%
        mutate(qvalue = cummean(PEP))

# plot the q-values versus the bayesian estimates
career_eb %>%
    ggplot(aes(x=eb_estimate, y=-log10(qvalue))) + 
    geom_point(alpha=0.8, aes(colour=log10(AB))) + 
    scale_color_viridis() +
    xlab('Empirical Bayes Estimate')
In [63]:
career_eb %>%
    ggplot(aes(x=qvalue)) +
    geom_histogram(aes(y=..density..))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There! We could now set a q-value threshold and control the FDR at that threshold (on average) for our dataset. Let's ask how many players have a batting rate greater than 0.3 with an FDR threshold of 0.01 (1% false positive rate):

In [66]:
nrow(career_eb %>%
    filter(qvalue < 0.01))
61

This concludes the first post on Baseball Statistics. I've found David Robinson's book eminently readable and an excellent introduction to the subject. I think I have a few ideas about places where Empirical Bayes could be extremely powerful, but I have a few outstanding questions:

  1. Empirical Bayes works well when large amounts of data are available. How large is large? How small is too small? Some soft bounds would be nice as a guide (I will explore this through simulation in another post)
  2. Empirical Bayes seems very powerful, but a major drawback is that we can't easily get an idea about uncertainty in the hyperparameters, and how that uncertainty propagates into/from the parameters. Are there diagnostic tools to tell us when Empirical Bayes may be making a mistake, and we should opt for a full-fledged MCMC approach?

All in all, I have had fun learning about Empirical Bayes. I suppose the most important take homes for me are the fact that Empirical Bayes can be a good first approximation to hierarchical models. Whereas a hierarchical Bayesian model would model the hyperparameters as a distribution, Empirical Bayes sets these hyperparameters to their most likely value inferred from the data. The hierarchical approximation and the empirical approach will be in good agreement if the distribution is nicely peaked around the most likely estimate, but can diverge otherwise. Of course, the reason to use full hierarchical models is often that the distributions are NOT nicely peaked, so the purist will say that a full hierarchical treatment is always required. However, this disregards the power of the empirical approach: It can be implemented extremely quickly.