A conceptual introduction to Hamiltonian Monte Carlo

Bayesian workshop - STEP 2023

Scott James Perry

University of Alberta

By the end of this lesson, you will be able to…

Describe how we approximate the posterior distribution using HMC
Be able to modify aspects of the algorithm to solve common problems
Process the fit model to answer research questions and visualize inferences

Historical background

Started on work in Los Alamos, NM during WWII
Resulted in Metropolis algorithm
Metropolis-Hastings followed

Markov Chain Monte Carlo

A Markov chain (or process) transitions between different states
Monte Carlo: we run simulations to obtain probabilities

We sample from a distribution proportional to the posterior

What does MCMC have to do with Bayes? Nothing… and everything…

Really simple problems solvable by hand
Simple models can use quadratic approximation
Most models in psycholinguistics can’t… therefore MCMC

The parameter space

This could be a parameter space for a simple model
More complicated models don’t visualize well
Center high-ground shows most probable values

The parameter space as an invisible skatepark

Your job:

Map the shape of the skate park, but it’s invisible and you can’t touch it

The parameter space as an invisible skatepark

You have:

Magic baseball that freezes in place on command
A magic marker that can draw on the skate park

Hamiltonian Monte Carlo is complicated, but works really well

Hamiltonian Monte Carlo flicks a particle through the ‘parameter space’
This particle will spend more time in ‘deeper’ regions
This involves predicting where it will go

A fastastic visualization of HMC

https://chi-feng.github.io/mcmc-demo/app.html

Some HMC arguments in function `brm()`

chains
iter (iterations)
warmup

model_fit <- brm(model_formula, 
                 data = df, 
                 family = skew_normal(), 
                 prior = my_awesome_priors, 
                 cores = 4, 
                 chains = 4, 
                 iter = 2000, 
                 warmup = 1000, 
                 file = "my_great_model.rds")

Chains are independent runs of the algorithm

Let’s say you have a difficult math problem:

Grad student from Math dept. can help you
They sit down for a few minutes, then give you an answer

How confident are you going to be with any answer?

Chains are independent runs of the algorithm

Same problem:

Now you have four graduate students from Math dept.
You consult them all independently, all give you the same answer

Would we all be more confident in the answer?

Iterations and the warmup phase

Iterations:

the number of times we flick the particle
Each one a single sample from posterior

Iterations and the warmup phase

Iterations:

the number of times we flick the particle
Each one a single sample from posterior

Warmup:

the algorithm is finding optimal transformations/settings for your model/data
these samples are thrown out after
default is half the total value of iter

Some parameter spaces are weird…

But HMC let’s you know when things get rough (sometimes)

Rhat
Max_treedepth
Divergent transitions
Low ESS

\(\hat{R}\) complains if different chains are doing different things

We want:

All \(\hat{R}\) (Rhat) values 1.00 or 1.01
Trace plots show chains overlap like fuzzy caterpillars

What are divergent transitions?

HMC predicts particle’s path
Difference between prediction and reality called divergent transition
Increasing adapt_delta slows down algorithm
Persistent divergent transitions indicate problems

model_fit <- brm(...,
                 control = list(adapt_delta = 0.99),
                 ...)

Treedepth and low effective sample size

Exceeding max treedepth

An efficiency concern
Can indicate model misspecification

Low ESS:

Complicated models sometimes need more iterations
Happens when samples are correlated with previous samples

Stan (and brms) are pretty good about helping you out

Stan has excellent documentation, and warnings often direct you to helpful pages

So we fit a model

After throwing out warmup draws, we are left with 1000s of samples from posterior distribution

What do we do with them?

Statistical inference as descriptive statistics

Because we have samples from posterior, making inferences is like doing descriptive statistics:

take the mean/median/mode
calculate % below or above threshold
interpret these descriptions as probabilities

Getting draws from model using `as_draws_df()`

We can get a dataframe of posterior samples (draws) for each predictor with this function
Package tidybayes for processing models in tidyverse-style code

model_draws <- as_draws_df(model_fit)
head(model_draws)

# A draws_df: 6 iterations, 1 chains, and 5 variables
  b_Intercept b_scaled_Freq sigma lprior  lp__
1         691           -82   235   -6.3 -1367
2         707           -83   220   -6.3 -1367
3         698           -92   263   -6.5 -1372
4         681           -62   207   -6.2 -1369
5         724           -80   211   -6.2 -1370
6         676           -57   243   -6.4 -1369
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

Getting draws from model using `as_draws_df()`

We can get a dataframe of posterior samples (draws) for each predictor with this function
Package tidybayes for processing models in tidyverse-style code

nrow(model_draws[model_draws$b_scaled_Freq < -50,])/nrow(model_draws)

[1] 0.946

Plotting using package `ggdist`

This package adds functionality to ggplot2 for plotting Bayesian models

library(ggplot2)
library(ggdist)

ggplot(model_draws, aes(x = b_scaled_Freq)) +
  stat_dist_halfeye() +
  theme_bw()

Let’s head back into R

We are going to practice handling and visualizing posterior draws in script S2_E2_processing_draws.R

A conceptual introduction to Hamiltonian Monte Carlo

By the end of this lesson, you will be able to…

Historical background

Markov Chain Monte Carlo

What does MCMC have to do with Bayes? Nothing… and everything…

The parameter space

The parameter space as an invisible skatepark

The parameter space as an invisible skatepark

Hamiltonian Monte Carlo is complicated, but works really well

A fastastic visualization of HMC

Some HMC arguments in function brm()

Chains are independent runs of the algorithm

Chains are independent runs of the algorithm

Iterations and the warmup phase

Iterations and the warmup phase

Some parameter spaces are weird…

But HMC let’s you know when things get rough (sometimes)

\(\hat{R}\) complains if different chains are doing different things

What are divergent transitions?

Treedepth and low effective sample size

Stan (and brms) are pretty good about helping you out

So we fit a model

Statistical inference as descriptive statistics

Getting draws from model using as_draws_df()

Getting draws from model using as_draws_df()

Plotting using package ggdist

Let’s head back into R

Some HMC arguments in function `brm()`

Getting draws from model using `as_draws_df()`

Getting draws from model using `as_draws_df()`

Plotting using package `ggdist`