Introduction to linear mixed models?

Harnessing random effects

Author

Affiliation

Julien Martin

BIO 8940 - Lecture 6

Published

September 19, 2024

1 When you are lacking independence …

1.1 Mixed models to the rescue!

Almost all biological data is inherently grouped
- Lakes
- Incubators
- Individual animals/plants
- Sites

Mixed models should probably be the rule, not the exception

Give examples and discuss

1.2 Advantages of mixed models

Allows us to account for repeated measurements on the same group
Allows for missing data. Does not drop entire subject if missing one observation for that subject.
Helps limit problem of over-fitting many different parameters
Avoids the need to average within a group which underestimates the true variation

Impact of factor with 50 levels on df vs 1 random effect with 50 level
Underestimation true variation = overconfidence in results and lower p-values

1.3 Linear model so far

\[ Y_i = \underbrace{ a + \sum_k^n b_k X_{k_i}}_\text{fixed} + \underbrace{\epsilon_i}_\text{random} \]

\[\epsilon_i \sim N(0, {\sigma_{\epsilon}}^2)\]

Fixed part:
- describe deterministic processes.
Random part:
- describe stochastic processes.

Problems when

Observations are not independent (Pseudo-replication)
Heterogeneous variance

1.4 Extending the linear model

\[ Y_{ij} = \underbrace{a + \sum_k^n b_k X_{k_{ij}}}_\text{fixed} + \underbrace{\epsilon_{ij} + a_j}_\text{random} \]

\[\epsilon_{ij} \sim N(0, {\sigma_{\epsilon}}^2)\]

\[ a_j \sim N(0, {\sigma_{a}}^2) \]

Fixed part:
- Same as before.
Random part:
- Include grouping variables structuring the data

2 Example

2.1 The data

Marine benthic data from nine inter-tidal areas along the Dutch coast collected by the Dutch institute RIKZ in the summer of 2002. In each inter-tidal area (denoted by ‘beach’), five samples were taken, and the macro-fauna and abiotic variables were measured.

Sample: sample number
Richness: species richness
Exposure: index composed of the surf zone, slope, grain size, and depth of anaerobic layer
NAP: height of sampling station compared to mean tidal level
Beach: beach identifier

  Sample Richness Exposure    NAP Beach
1      1       11       10  0.045     1
2      2       10       10 -1.036     1
3      3       13       10 -1.336     1
4      4       11       10  0.616     1
5      5       10       10 -0.684     1
6      6        8        8  1.190     2

2.2 Data structure

2.3 Traditional linear model

\[Richness_{i} = a + b_1\ NAP_{i} + b_2\ Exposure_{i} + \epsilon_{i}\]

What are the problems?

Pseudo-replication * Exposure varies only across beaches not within * Sites within a beach are not independent

2.4 2 processes at 2 different levels

Within a beach effect: NAP

Between beaches effect: Exposure

2.5 Within a beach

Equation for each beach

\[Richness_{i1} = a_1 + b_{1_1}\ NAP_{i1} + \epsilon_{i1}\] \[Richness_{i2} = a_2 + b_{1_2}\ NAP_{i2} + \epsilon_{i2}\]

…

\[Richness_{i9} = a_9 + b_{1_9}\ NAP_{i9} + \epsilon_{i9}\]

9 equations and 27 parameters Not great

Overall, \[Richness_{ij} = a_j + b_{1_j}\ NAP_{ij} + \epsilon_{ij}\]

2.6 Between beaches

Cannot simply do \(Richness \sim Exposure\)

In \[ Richness_{ij} = a_j + b_{1_j}\ NAP_{ij} + \epsilon_{ij} \]

What captures the effect of Exposure

Can we do ?

\[ a_{j} = a + b_{exp}\ exposure_{j} + \epsilon_{j} \]

2.7 Within and between beaches effects

within effects -> 27 parameters

Between effects -> 3 parameters

Total 30 parameters

😱 😭

(with 3 being stats-on stats)

2.8 Estimating both simultaneously

\[ Richness_{ij} = (a + a_j) + b_1 NAP_{ij} + b_2 Exposure_j + \epsilon_{ij} \]

a is the general intercept
a_j are the deviation from \(a\) for each beach
b₁ and b₂ are the intercept and slope for the average beach

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[a_j \sim N(0, {\sigma_{a}}^2)\] \[\epsilon_{ij} \sim N(0, {\sigma_{\epsilon}}^2)\]

😃 5 parameters estimated 😃

assuming same NAP effect on each beach
assuming same residual variance for each beach

2.9 Fixed and random effects

Fixed

a, b₁ and b₂ estimated with a given error

Random

a_j not estimated directly but assumed to come from a Gaussian distribution with a mena of 0 and an estimated Variance

\[ a_j \sim N(0, {\sigma^2_{a_j}}) \]

2.10 Fixed part

\[y_{ij} = \underbrace{\color{red}{a + b_1{NAP_{ij}}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random}\]

2.11 Random part

\[ Richness_{ij} = \underbrace{\color{red}{a + b_1{NAP_{ij}}} + b_2{Exposure_j}}_\text{fixed} + \color{purple}{\underbrace{a_j + \epsilon_{ij}}_\text{random}} \]

2.12 R model output

summary(mod.mix)

Linear mixed model fit by REML ['lmerMod']
Formula: Richness ~ NAP + scale(Exposure, scale = F) + (1 | Beach)

REML criterion at convergence: 225.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.3462 -0.5025 -0.2306  0.2150  4.2746 

Random effects:
 Groups   Name        Variance Std.Dev.
 Beach    (Intercept) 0.3378   0.5812  
 Residual             9.3735   3.0616  
Number of obs: 45, groups:  Beach, 9

Fixed effects:
                           Estimate Std. Error t value
(Intercept)                  6.6256     0.5221  12.691
NAP                         -2.6941     0.4700  -5.732
scale(Exposure, scale = F)  -3.0005     0.5417  -5.539

Correlation of Fixed Effects:
            (Intr) NAP   
NAP         -0.313       
scl(Ex,s=F)  0.015 -0.047

2.13 Estimated equation

\[ Richness_{ij} = \underbrace{6.63 -2.69{NAP_{ij}} -3{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, 0.34) \]

\[ \epsilon_{ij} \sim N(0, 9.37) \]

2.14 BLUPs (\(a_j\)) are shrunk

Estimates for each groups are constrained (shrunk) too avoid extreme values.

3 Variance components

3.1 Modelling the variance

We estimated the variance explained by the model

\({\sigma_{\hat{y}}}^2\).

We decomposed teh residual variance in 2 :
- \({\sigma_a}^2\) : across beach variance
- \({\sigma_{\epsilon}}^2\) Variation not associated to beaches (i.e. within-beach variance)

3.2 Estimating repeatability

Repeatability Proportion of (residual) variance associated to among levels of a random effect.
Here variation associated to among beach differences

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, {\sigma_a}^2) \]

\[\epsilon \sim N(0, {\sigma_{\epsilon}}^2)\]

\[ r = \frac{{\sigma_a}^2}{{\sigma_a}^2 + {\sigma_{\epsilon}}^2}\]

3.3 With our model

\[ Richness_{ij} = \underbrace{37.3 -2.69{NAP_{ij}} -3{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[a_j \sim N(0, 0.34)\]

\[\epsilon_{ij} \sim N(0, 9.37)\]

\[ r^2 = \frac{0.34}{0.34 + 9.37} = \frac{0.34}{9.71} = 0.03 \]

3.4 Repeatability is conditioned on fixed effects

Differences among groups can be explained by fixed effects and thus influence \({\sigma_a}^2\)

We can estimate repeatability with 2 models for example:

With exposure \[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

or without

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

3.5 Repeatability With exposure effects

\[ Richness_{ij} = \underbrace{37.3 -2.69{NAP_{ij}} -3{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, 0.338) \]

\[ \epsilon_{ij} \sim N(0, 9.374) \]

\[ r\ (with\ exposure) = \frac{0.338}{0.338 + 9.37} = 0.03 \]

3.6 Repeatability without exposure

\[ Richness_{ij} = \underbrace{6.58 -2.57{NAP_{ij}}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, 8.668) \]

\[ \epsilon_{ij} \sim N(0, 9.362) \]

\[ r\ (without\ exposure) = \frac{8.668}{8.668 + 9.36} = 0.48 \]

4 Models with multiple random effects

4.1 When tricky becomes trickier

Complex model use only when necessary
Easy to build a nonsensical models which is over-parametrized …

Key to understand differences between random effect types

4.2 Nested effects

B is nested within A if each level of B is present in only 1 level of B

4.3 Crossed effects

A and B are crossed .. when they are not nested 😜

A and B are fully crossed when all levels of B are present in all levels of A

4.4 Random regression

Previous model assumed that NAP effect was the same for all beaches

This might not be true

Beaches might vary in both their mean Richness and their response to NAP

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \color{#DD3333}{b_{1j}NAP_{ij}} + \epsilon_{ij}}_\text{random} \]

\[\epsilon_{ij} \sim N(0, \sigma^2_{\epsilon})\]

\[\begin{bmatrix} a_j \\ \color{#DD3333}{b_{1j}} \end{bmatrix} \sim N\left(0, \begin{matrix} \sigma_a^2 & \color{blue}{cov(a_j, b_{1j})} \\ \color{blue}{cov(a_j, b_{1j})} & \color{#DD3333}{\sigma_{b_{1j}}^2} \end{matrix} \right) \]

we are thus adding 2 extra parameters:

random slope
a covariance between the intercept and slope

4.5 Random regression

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{\color{#DD3333}{a_j + b_{1j}NAP_{ij}} + \epsilon_{ij}}_\text{random} \]

5 Building a mixed model

5.1 Choosing effects

Fixed effects

based on question
based on confounding effects (think and justify based on causality)

Random effects

based on questions
based on data structure

5.2 Testing for effects

Fixed effects

Degrees of freedom cannot be estimated exactly

Use approximation (Satterwaite, Kenward-Rogers, …)

Random effects

Likelihood Ratio test

Fit model with and without a random effect using REML and then compare

Happy modelling