Introduction to linear mixed models?

Harnessing random effects

Author
Affiliation

Julien Martin

BIO 8940 - Lecture 6

Published

September 19, 2024

1 When you are lacking independence …

1.1 Mixed models to the rescue!

  • Almost all biological data is inherently grouped

    • Lakes

    • Incubators

    • Individual animals/plants

    • Sites

Mixed models should probably be the rule, not the exception

Give examples and discuss

1.2 Advantages of mixed models

  • Allows us to account for repeated measurements on the same group

  • Allows for missing data. Does not drop entire subject if missing one observation for that subject.

  • Helps limit problem of over-fitting many different parameters

  • Avoids the need to average within a group which underestimates the true variation

  • Impact of factor with 50 levels on df vs 1 random effect with 50 level
  • Underestimation true variation = overconfidence in results and lower p-values

1.3 Linear model so far

\[ Y_i = \underbrace{ a + \sum_k^n b_k X_{k_i}}_\text{fixed} + \underbrace{\epsilon_i}_\text{random} \]

\[\epsilon_i \sim N(0, {\sigma_{\epsilon}}^2)\]

  • Fixed part:
    • describe deterministic processes.
  • Random part:
    • describe stochastic processes.

Problems when

  1. Observations are not independent (Pseudo-replication)
  2. Heterogeneous variance

1.4 Extending the linear model

\[ Y_{ij} = \underbrace{a + \sum_k^n b_k X_{k_{ij}}}_\text{fixed} + \underbrace{\epsilon_{ij} + a_j}_\text{random} \]

\[\epsilon_{ij} \sim N(0, {\sigma_{\epsilon}}^2)\]

\[ a_j \sim N(0, {\sigma_{a}}^2) \]

  • Fixed part:
    • Same as before.
  • Random part:
    • Include grouping variables structuring the data

2 Example

2.1 The data

Marine benthic data from nine inter-tidal areas along the Dutch coast collected by the Dutch institute RIKZ in the summer of 2002. In each inter-tidal area (denoted by ‘beach’), five samples were taken, and the macro-fauna and abiotic variables were measured.

  • Sample: sample number

  • Richness: species richness

  • Exposure: index composed of the surf zone, slope, grain size, and depth of anaerobic layer

  • NAP: height of sampling station compared to mean tidal level

  • Beach: beach identifier

  Sample Richness Exposure    NAP Beach
1      1       11       10  0.045     1
2      2       10       10 -1.036     1
3      3       13       10 -1.336     1
4      4       11       10  0.616     1
5      5       10       10 -0.684     1
6      6        8        8  1.190     2

2.2 Data structure

G a Study b1 Beach 1 a--b1 b2 Beach 2 a--b2 b3 Beach 3 a--b3 s1 Site 1 b1--s1 s2 Site 2 b1--s2 s3 Site 3 b1--s3 s4 Site 4 b1--s4 s5 Site 5 b1--s5 s6 Site 6 b2--s6 s7 Site 7 b2--s7 s8 Site 8 b2--s8 s9 Site 9 b2--s9 s10 Site 10 b2--s10 s11 Site 11 b3--s11 s12 Site 12 b3--s12 s13 Site 13 b3--s13 s14 Site 14 b3--s14 s15 Site 15 b3--s15

2.3 Traditional linear model


\[Richness_{i} = a + b_1\ NAP_{i} + b_2\ Exposure_{i} + \epsilon_{i}\]


What are the problems?

Pseudo-replication * Exposure varies only across beaches not within * Sites within a beach are not independent

2.4 2 processes at 2 different levels

  1. Within a beach effect: NAP


  1. Between beaches effect: Exposure

2.5 Within a beach

  • Equation for each beach

\[Richness_{i1} = a_1 + b_{1_1}\ NAP_{i1} + \epsilon_{i1}\] \[Richness_{i2} = a_2 + b_{1_2}\ NAP_{i2} + \epsilon_{i2}\]

\[Richness_{i9} = a_9 + b_{1_9}\ NAP_{i9} + \epsilon_{i9}\]

9 equations and 27 parameters Not great

Overall, \[Richness_{ij} = a_j + b_{1_j}\ NAP_{ij} + \epsilon_{ij}\]

2.6 Between beaches

  • Cannot simply do \(Richness \sim Exposure\)
  • In \[ Richness_{ij} = a_j + b_{1_j}\ NAP_{ij} + \epsilon_{ij} \]

What captures the effect of Exposure

  • Can we do ?

\[ a_{j} = a + b_{exp}\ exposure_{j} + \epsilon_{j} \]

2.7 Within and between beaches effects


  • within effects -> 27 parameters


  • Between effects -> 3 parameters


Total 30 parameters

😱 😭

(with 3 being stats-on stats)

2.8 Estimating both simultaneously

\[ Richness_{ij} = (a + a_j) + b_1 NAP_{ij} + b_2 Exposure_j + \epsilon_{ij} \]

  • a is the general intercept
  • aj are the deviation from \(a\) for each beach
  • b1 and b2 are the intercept and slope for the average beach

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[a_j \sim N(0, {\sigma_{a}}^2)\] \[\epsilon_{ij} \sim N(0, {\sigma_{\epsilon}}^2)\]

😃 5 parameters estimated 😃

  • assuming same NAP effect on each beach
  • assuming same residual variance for each beach

2.9 Fixed and random effects

Fixed

  • a, b1 and b2 estimated with a given error


Random

  • aj not estimated directly but assumed to come from a Gaussian distribution with a mena of 0 and an estimated Variance

\[ a_j \sim N(0, {\sigma^2_{a_j}}) \]

2.10 Fixed part

\[y_{ij} = \underbrace{\color{red}{a + b_1{NAP_{ij}}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random}\]

2.11 Random part

\[ Richness_{ij} = \underbrace{\color{red}{a + b_1{NAP_{ij}}} + b_2{Exposure_j}}_\text{fixed} + \color{purple}{\underbrace{a_j + \epsilon_{ij}}_\text{random}} \]

2.12 R model output

summary(mod.mix)
Linear mixed model fit by REML ['lmerMod']
Formula: Richness ~ NAP + scale(Exposure, scale = F) + (1 | Beach)

REML criterion at convergence: 225.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.3462 -0.5025 -0.2306  0.2150  4.2746 

Random effects:
 Groups   Name        Variance Std.Dev.
 Beach    (Intercept) 0.3378   0.5812  
 Residual             9.3735   3.0616  
Number of obs: 45, groups:  Beach, 9

Fixed effects:
                           Estimate Std. Error t value
(Intercept)                  6.6256     0.5221  12.691
NAP                         -2.6941     0.4700  -5.732
scale(Exposure, scale = F)  -3.0005     0.5417  -5.539

Correlation of Fixed Effects:
            (Intr) NAP   
NAP         -0.313       
scl(Ex,s=F)  0.015 -0.047

2.13 Estimated equation

\[ Richness_{ij} = \underbrace{6.63 -2.69{NAP_{ij}} -3{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, 0.34) \]

\[ \epsilon_{ij} \sim N(0, 9.37) \]

2.14 BLUPs (\(a_j\)) are shrunk

  • Estimates for each groups are constrained (shrunk) too avoid extreme values.

3 Variance components

3.1 Modelling the variance

  • We estimated the variance explained by the model

\({\sigma_{\hat{y}}}^2\).

  • We decomposed teh residual variance in 2 :
    • \({\sigma_a}^2\) : across beach variance
    • \({\sigma_{\epsilon}}^2\) Variation not associated to beaches (i.e. within-beach variance)

3.2 Estimating repeatability

  • Repeatability Proportion of (residual) variance associated to among levels of a random effect.

  • Here variation associated to among beach differences

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, {\sigma_a}^2) \]

\[\epsilon \sim N(0, {\sigma_{\epsilon}}^2)\]

\[ r = \frac{{\sigma_a}^2}{{\sigma_a}^2 + {\sigma_{\epsilon}}^2}\]

3.3 With our model

\[ Richness_{ij} = \underbrace{37.3 -2.69{NAP_{ij}} -3{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[a_j \sim N(0, 0.34)\]

\[\epsilon_{ij} \sim N(0, 9.37)\]


\[ r^2 = \frac{0.34}{0.34 + 9.37} = \frac{0.34}{9.71} = 0.03 \]

3.4 Repeatability is conditioned on fixed effects

Differences among groups can be explained by fixed effects and thus influence \({\sigma_a}^2\)

We can estimate repeatability with 2 models for example:

  • With exposure \[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]
  • or without

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

3.5 Repeatability With exposure effects

\[ Richness_{ij} = \underbrace{37.3 -2.69{NAP_{ij}} -3{Exposure_j}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, 0.338) \]

\[ \epsilon_{ij} \sim N(0, 9.374) \]

\[ r\ (with\ exposure) = \frac{0.338}{0.338 + 9.37} = 0.03 \]

3.6 Repeatability without exposure

\[ Richness_{ij} = \underbrace{6.58 -2.57{NAP_{ij}}}_\text{fixed} + \underbrace{a_j + \epsilon_{ij}}_\text{random} \]

\[ a_j \sim N(0, 8.668) \]

\[ \epsilon_{ij} \sim N(0, 9.362) \]

\[ r\ (without\ exposure) = \frac{8.668}{8.668 + 9.36} = 0.48 \]

4 Models with multiple random effects

4.1 When tricky becomes trickier

  • Complex model use only when necessary

  • Easy to build a nonsensical models which is over-parametrized …

Key to understand differences between random effect types

4.2 Nested effects

B is nested within A if each level of B is present in only 1 level of B

4.3 Crossed effects

A and B are crossed .. when they are not nested 😜

A and B are fully crossed when all levels of B are present in all levels of A

G a Study b1 A 1 a--b1 b2 A 2 a--b2 b3 A 3 a--b3 s1 B 1 b1--s1 s2 B 2 b1--s2 s3 B 3 b1--s3 b2--s1 b2--s2 b2--s3 b3--s1 b3--s2 b3--s3

4.4 Random regression

  • Previous model assumed that NAP effect was the same for all beaches

This might not be true

  • Beaches might vary in both their mean Richness and their response to NAP

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{a_j + \color{#DD3333}{b_{1j}NAP_{ij}} + \epsilon_{ij}}_\text{random} \]

\[\epsilon_{ij} \sim N(0, \sigma^2_{\epsilon})\]

\[\begin{bmatrix} a_j \\ \color{#DD3333}{b_{1j}} \end{bmatrix} \sim N\left(0, \begin{matrix} \sigma_a^2 & \color{blue}{cov(a_j, b_{1j})} \\ \color{blue}{cov(a_j, b_{1j})} & \color{#DD3333}{\sigma_{b_{1j}}^2} \end{matrix} \right) \]

we are thus adding 2 extra parameters:

  • random slope

  • a covariance between the intercept and slope

4.5 Random regression

\[ Richness_{ij} = \underbrace{a + b_1{NAP_{ij}} + b_2{Exposure_j}}_\text{fixed} + \underbrace{\color{#DD3333}{a_j + b_{1j}NAP_{ij}} + \epsilon_{ij}}_\text{random} \]

5 Building a mixed model

5.1 Choosing effects

Fixed effects

  1. based on question
  2. based on confounding effects (think and justify based on causality)


Random effects

  1. based on questions
  2. based on data structure

5.2 Testing for effects

Fixed effects

Degrees of freedom cannot be estimated exactly

Use approximation (Satterwaite, Kenward-Rogers, …)


Random effects

Likelihood Ratio test

Fit model with and without a random effect using REML and then compare

Happy modelling