Getting categorical variables into our models: an intro to contrast coding

Bayesian workshop - STEP 2023

Scott James Perry

University of Alberta

By the end of this lesson, you will be able to…

  1. Include categorical variables in linear models in three distinct ways
  2. Correctly interpret these variables by themselves and inside interactions

The problem: categories aren’t numbers and regression needs numbers

We’re going to talk about three ways to do this:

  • dummy coding
  • index variables
  • sum coding

Wait! This doesn’t have anything to do with Bayesian modelling…

Different coding in base R produces identical models communicating differently

Similar priors can interact with contrast coding to change resulting posterior distributions

The default in R: dummy coding

One less coefficient than the number of factor levels (k-1):

  • First level represented by intercept
  • Other levels the difference compared to Intercept
  • E.g., let’s say we have have three levels: A, B, C

  \(y \sim Intercept + \beta_1 B + \beta_2 C\)

Row # B C
1 0 0
2 0 0
3 1 0
4 1 0
5 0 1
6 0 1

Kicking the reference category to the curb: index variables

We can also just estimate the mean value for each category:

  • Separate means estimated for each category
  • Differences between them calculated after model fitting

       \(y \sim \alpha_i\)

Row # A B C
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
6 0 0 1

Sum-coding sums to zero

  • We code our categories in sum-coding for they add up to zero
  • This means the Intercept is the mean of dependent variable at the average of all categories
Group_sumcoded Group
-0.5 L1
-0.5 L1
-0.5 L1
0.5 L2
0.5 L2
0.5 L2

Different coding in model predicting rt by group

Representation wanted:

  1. Dummy coding in R
  2. Index variable in R
  3. Sum coding (also changing contrasts)

Formula we enter:

  1. rt ~ 0 + Intercept + group_factor
  2. rt ~ 0 + group_factor
  3. rt ~ 0 + Intercept + group_sumcoded

Equations for our different options

This is the dummy coding version:

\(rt \sim Normal(\mu, \sigma)\)
\(\mu = Intercept + \beta_1 group\_factor_{L2}\)

Equations for our different options

This is the index coding version:

\(rt \sim Normal(\mu, \sigma)\)
\(\mu = group\_factor_i\)

Equations for our different options

This is the sum coding version:

\(rt \sim Normal(\mu, \sigma)\)
\(\mu = Intercept + \beta_1 group\_sumcoded\)

Now we are going to practice fitting and interpreting categorical effects



Let’s open up the script S3_E1_comparing_contrasts.R