Getting categorical variables into our models: an intro to contrast coding

Bayesian workshop - STEP 2023

Scott James Perry

University of Alberta

By the end of this lesson, you will be able to…

Include categorical variables in linear models in three distinct ways
Correctly interpret these variables by themselves and inside interactions

The problem: categories aren’t numbers and regression needs numbers

We’re going to talk about three ways to do this:

dummy coding
index variables
sum coding

Wait! This doesn’t have anything to do with Bayesian modelling…

Different coding in base R produces identical models communicating differently

Similar priors can interact with contrast coding to change resulting posterior distributions

The default in R: dummy coding

One less coefficient than the number of factor levels (k-1):

First level represented by intercept
Other levels the difference compared to Intercept
E.g., let’s say we have have three levels: A, B, C

\(y \sim Intercept + \beta_1 B + \beta_2 C\)

Row #	B	C
1	0	0
2	0	0
3	1	0
4	1	0
5	0	1
6	0	1

Kicking the reference category to the curb: index variables

We can also just estimate the mean value for each category:

Separate means estimated for each category
Differences between them calculated after model fitting

\(y \sim \alpha_i\)

Row #	A	B	C
1	1	0	0
2	1	0	0
3	0	1	0
4	0	1	0
5	0	0	1
6	0	0	1

Sum-coding sums to zero

We code our categories in sum-coding for they add up to zero
This means the Intercept is the mean of dependent variable at the average of all categories

Group_sumcoded	Group
-0.5	L1
-0.5	L1
-0.5	L1
0.5	L2
0.5	L2
0.5	L2

Different coding in model predicting `rt` by `group`

Representation wanted:

Dummy coding in R
Index variable in R
Sum coding (also changing contrasts)

Formula we enter:

rt ~ 0 + Intercept + group_factor
rt ~ 0 + group_factor
rt ~ 0 + Intercept + group_sumcoded

Equations for our different options

This is the dummy coding version:

\(rt \sim Normal(\mu, \sigma)\)
\(\mu = Intercept + \beta_1 group\_factor_{L2}\)

Equations for our different options

This is the index coding version:

\(rt \sim Normal(\mu, \sigma)\)
\(\mu = group\_factor_i\)

Equations for our different options

This is the sum coding version:

\(rt \sim Normal(\mu, \sigma)\)
\(\mu = Intercept + \beta_1 group\_sumcoded\)

Now we are going to practice fitting and interpreting categorical effects

Let’s open up the script S3_E1_comparing_contrasts.R

Getting categorical variables into our models: an intro to contrast coding

By the end of this lesson, you will be able to…

The problem: categories aren’t numbers and regression needs numbers

Wait! This doesn’t have anything to do with Bayesian modelling…

Different coding in base R produces identical models communicating differently

Similar priors can interact with contrast coding to change resulting posterior distributions

The default in R: dummy coding

Kicking the reference category to the curb: index variables

Sum-coding sums to zero

Different coding in model predicting rt by group

Equations for our different options

Equations for our different options

Equations for our different options

Now we are going to practice fitting and interpreting categorical effects

Different coding in model predicting `rt` by `group`