Getting categorical variables into our models: an intro to contrast coding
Bayesian workshop - STEP 2023
Scott James Perry
University of Alberta
By the end of this lesson, you will be able to…
- Include categorical variables in linear models in three distinct ways
- Correctly interpret these variables by themselves and inside interactions
The problem: categories aren’t numbers and regression needs numbers
We’re going to talk about three ways to do this:
- dummy coding
- index variables
- sum coding
Wait! This doesn’t have anything to do with Bayesian modelling…
Different coding in base R produces identical models communicating differently
Similar priors can interact with contrast coding to change resulting posterior distributions
The default in R: dummy coding
One less coefficient than the number of factor levels (k-1):
- First level represented by intercept
- Other levels the difference compared to Intercept
- E.g., let’s say we have have three levels: A, B, C
\(y \sim Intercept + \beta_1 B + \beta_2 C\)
1 |
0 |
0 |
2 |
0 |
0 |
3 |
1 |
0 |
4 |
1 |
0 |
5 |
0 |
1 |
6 |
0 |
1 |
Kicking the reference category to the curb: index variables
We can also just estimate the mean value for each category:
- Separate means estimated for each category
- Differences between them calculated after model fitting
\(y \sim \alpha_i\)
1 |
1 |
0 |
0 |
2 |
1 |
0 |
0 |
3 |
0 |
1 |
0 |
4 |
0 |
1 |
0 |
5 |
0 |
0 |
1 |
6 |
0 |
0 |
1 |
Sum-coding sums to zero
- We code our categories in sum-coding for they add up to zero
- This means the
Intercept
is the mean of dependent variable at the average of all categories
-0.5 |
L1 |
-0.5 |
L1 |
-0.5 |
L1 |
0.5 |
L2 |
0.5 |
L2 |
0.5 |
L2 |
Different coding in model predicting rt
by group
Representation wanted:
- Dummy coding in R
- Index variable in R
- Sum coding (also changing
contrasts
)
Formula we enter:
rt ~ 0 + Intercept + group_factor
rt ~ 0 + group_factor
rt ~ 0 + Intercept + group_sumcoded
Equations for our different options
This is the dummy coding version:
\(rt \sim Normal(\mu, \sigma)\)
\(\mu = Intercept + \beta_1 group\_factor_{L2}\)
Equations for our different options
This is the index coding version:
\(rt \sim Normal(\mu, \sigma)\)
\(\mu = group\_factor_i\)
Equations for our different options
This is the sum coding version:
\(rt \sim Normal(\mu, \sigma)\)
\(\mu = Intercept + \beta_1 group\_sumcoded\)
Now we are going to practice fitting and interpreting categorical effects
Let’s open up the script S3_E1_comparing_contrasts.R