A conceptual introduction to Hamiltonian Monte Carlo

Bayesian workshop - STEP 2023

Scott James Perry

University of Alberta

By the end of this lesson, you will be able to…

  1. Describe how we approximate the posterior distribution using HMC
  2. Be able to modify aspects of the algorithm to solve common problems
  3. Process the fit model to answer research questions and visualize inferences

Historical background

  • Started on work in Los Alamos, NM during WWII
  • Resulted in Metropolis algorithm
  • Metropolis-Hastings followed

Markov Chain Monte Carlo

  • A Markov chain (or process) transitions between different states
  • Monte Carlo: we run simulations to obtain probabilities

We sample from a distribution proportional to the posterior

What does MCMC have to do with Bayes? Nothing… and everything…

  1. Really simple problems solvable by hand
  2. Simple models can use quadratic approximation
  3. Most models in psycholinguistics can’t… therefore MCMC

The parameter space


  • This could be a parameter space for a simple model
  • More complicated models don’t visualize well
  • Center high-ground shows most probable values

The parameter space as an invisible skatepark

Your job:

  • Map the shape of the skate park, but it’s invisible and you can’t touch it

The parameter space as an invisible skatepark

You have:

  • Magic baseball that freezes in place on command
  • A magic marker that can draw on the skate park

Hamiltonian Monte Carlo is complicated, but works really well

  • Hamiltonian Monte Carlo flicks a particle through the ‘parameter space’
  • This particle will spend more time in ‘deeper’ regions
  • This involves predicting where it will go

A fastastic visualization of HMC

https://chi-feng.github.io/mcmc-demo/app.html

Some HMC arguments in function brm()


  • chains
  • iter (iterations)
  • warmup


model_fit <- brm(model_formula, 
                 data = df, 
                 family = skew_normal(), 
                 prior = my_awesome_priors, 
                 cores = 4, 
                 chains = 4, 
                 iter = 2000, 
                 warmup = 1000, 
                 file = "my_great_model.rds")

Chains are independent runs of the algorithm

Let’s say you have a difficult math problem:

  • Grad student from Math dept. can help you
  • They sit down for a few minutes, then give you an answer



How confident are you going to be with any answer?

Chains are independent runs of the algorithm

Same problem:

  • Now you have four graduate students from Math dept.
  • You consult them all independently, all give you the same answer



Would we all be more confident in the answer?

Iterations and the warmup phase

Iterations:

  • the number of times we flick the particle
  • Each one a single sample from posterior

Iterations and the warmup phase

Iterations:

  • the number of times we flick the particle
  • Each one a single sample from posterior

Warmup:

  • the algorithm is finding optimal transformations/settings for your model/data
  • these samples are thrown out after
  • default is half the total value of iter

Some parameter spaces are weird…

But HMC let’s you know when things get rough (sometimes)

  1. Rhat
  2. Max_treedepth
  3. Divergent transitions
  4. Low ESS

\(\hat{R}\) complains if different chains are doing different things

We want:

  • All \(\hat{R}\) (Rhat) values 1.00 or 1.01
  • Trace plots show chains overlap like fuzzy caterpillars

What are divergent transitions?

  • HMC predicts particle’s path
  • Difference between prediction and reality called divergent transition
  • Increasing adapt_delta slows down algorithm
  • Persistent divergent transitions indicate problems



model_fit <- brm(...,
                 control = list(adapt_delta = 0.99),
                 ...)

Treedepth and low effective sample size

Exceeding max treedepth

  • An efficiency concern
  • Can indicate model misspecification

Low ESS:

  • Complicated models sometimes need more iterations
  • Happens when samples are correlated with previous samples

Stan (and brms) are pretty good about helping you out


Stan has excellent documentation, and warnings often direct you to helpful pages

So we fit a model

After throwing out warmup draws, we are left with 1000s of samples from posterior distribution

What do we do with them?

Statistical inference as descriptive statistics

Because we have samples from posterior, making inferences is like doing descriptive statistics:

  • take the mean/median/mode
  • calculate % below or above threshold
  • interpret these descriptions as probabilities

Getting draws from model using as_draws_df()

  • We can get a dataframe of posterior samples (draws) for each predictor with this function
  • Package tidybayes for processing models in tidyverse-style code
model_draws <- as_draws_df(model_fit)
head(model_draws)
# A draws_df: 6 iterations, 1 chains, and 5 variables
  b_Intercept b_scaled_Freq sigma lprior  lp__
1         691           -82   235   -6.3 -1367
2         707           -83   220   -6.3 -1367
3         698           -92   263   -6.5 -1372
4         681           -62   207   -6.2 -1369
5         724           -80   211   -6.2 -1370
6         676           -57   243   -6.4 -1369
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

Getting draws from model using as_draws_df()

  • We can get a dataframe of posterior samples (draws) for each predictor with this function
  • Package tidybayes for processing models in tidyverse-style code
nrow(model_draws[model_draws$b_scaled_Freq < -50,])/nrow(model_draws)
[1] 0.946

Plotting using package ggdist


This package adds functionality to ggplot2 for plotting Bayesian models

library(ggplot2)
library(ggdist)

ggplot(model_draws, aes(x = b_scaled_Freq)) +
  stat_dist_halfeye() +
  theme_bw()

Let’s head back into R



We are going to practice handling and visualizing posterior draws in script S2_E2_processing_draws.R