Generalized Additive Models

An introduction

Author

Affiliation

Julien Martin

BIO 8940

Published

March 23, 2026

0.1 Required packages and datasets

For this class, we will be working with the following dataset:

ISIT.csv

You should also make sure you have downloaded, installed, and loaded the following packages:

install.packages(c('ggplot2', 'itsadug', 'mgcv'))

0.2 Class overview

The linear model… and where it fails
Introduction to GAMs
How GAMs work
GAMs with multiple smooth terms
GAMs with interaction terms

GAM validation
Choosing another distribution
Changing the basis function
Quick introduction to GAMMs (Mixed effect GAMs)

0.3 Learning objectives

Use the mgcv package to fit non-linear relationships,
Understand the output of a Generalized Additive Model (GAM) to help you understand your data,
Use tests to determine if a non-linear model fits better than a linear one,
Include smooth interactions between variables,
Understand the idea of a basis function, and why it makes GAMs so powerful,
Account for dependence in data (autocorrelation, hierarchical structure) using GAMMs.

1 The linear model

… and where it fails

1.1 Linear regression

Linear model with multiple predictors:

\[y_i = \beta_0 + \beta_1x_{1,i}+\beta_2x_{2,i}+\beta_3x_{3,i}+...+\beta_kx_{k,i} + \epsilon_i\]

. . .

As we saw in previous class, a linear model is based on four major assumptions:

Linear relationship between response and predictor variables
Normally distributed error: \[\epsilon_i \sim \mathcal{N}(0,\,\sigma^2)\]
Homogeneity of the errors’ variance
Independence of the errors (homoscedasticity)

A linear model can easily accommodate certain types of non-linear responses (e.g. \(x^2\)) but this approach strongly relies on (arbitrary or well-informed) choices, and is less flexible than using an additive model.

1.2 Linear regression

Linear models work very well in certain specific cases where all these criteria are met:

1.3 Linear regression

In reality, we often cannot meet these criteria. In many cases, linear models are inappropriate:

1.4 Linear regression

What’s the problem and do we fix it?

A linear model tries to fit the best straight line that passes through the data, so it doesn’t work well for all datasets.

In contrast, a GAM can capture complexe relationships by fitting a non-linear smooth function through the data, while controlling how wiggly the smooth can get (more on this later).

2 Introduction to GAMs

2.1 Generalized Additive Models (GAM)

Let us use an example to demonstrate the difference between a linear regression and an additive model.

First, we will load the ISIT dataset.

isit <- read.csv("data/ISIT.csv")
head(isit)

This dataset is comprised of bioluminescence levels (Sources) in relation to depth, seasons and different stations.

For now, we will be focusing on Season 2.

isit2 <- subset(isit, Season == 2)

Let’s begin by trying to fit a linear regression model to the relationship between Sources and SampleDepth.

linear_model <- gam(Sources ~ SampleDepth, data = isit2)

2.2 Generalized Additive Models (GAM)

summary(linear_model)


Family: gaussian 
Link function: identity 

Formula:
Sources ~ SampleDepth

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.9021874  0.7963891   38.80   <2e-16 ***
SampleDepth -0.0083450  0.0003283  -25.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.588   Deviance explained = 58.9%
GCV =  60.19  Scale est. = 59.924    n = 453

The linear model is explaining quite a bit of variance in our dataset ( \(R_{adj}\) = 0.588), which means it’s a pretty good model, right?

2.3 Generalized Additive Models (GAM)

Well, let’s take a look at how our model fits the data:

data_plot <- ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) +
  geom_point() +
  geom_line(aes(y = fitted(linear_model)), colour = "red", size = 1.2) +
  theme_bw()
data_plot

Are the assumptions of linear regression met in this case? As you may have noticed, we are violating the assumptions of the linear model:

There is a strong non-linear relationship between Sources and SampleDepth.
The error is not normally distributed.
The variance of the error is not homoscedastic.
The errors are not independent of each other.

2.4 Generalized Additive Models (GAM)

Relationship between the response variable and predictor variables

One predictor variable: \[y_i = \beta_0 + f(x_i) + \epsilon\]

Multiple predictor variables: \[y_i = \beta_0 + f_1(x_{1,i}) + f_2(x_{2,i}) + ... + \epsilon\]

One big advantage of using a GAM over a manual specification of the model is that the optimal shape, i.e. the degree of smoothness of \(f(x)\), is determined automatically depending on the fitting method (usually maximum likelihood).

Strictly speaking, the equations are for a Gaussian GAM with identity link, which is also called “additive model” (without “generalized”).

Importantly, given that the smooth function \(f(x_i)\) is non-linear and local, the magnitude of the effect of the explanatory variable can vary over its range, depending on the relationship between the variable and the response.

That is, as opposed to one fixed coefficient \(\beta\), the function \(f\) can continually change over the range of \(x_i\). The degree of smoothness (or wiggliness) of \(f\) is controlled using penalized regression determined automatically in mgcv using a generalized cross-validation (GCV) routine.

2.5 Generalized Additive Models (GAM)

Let’s try to fit a model to the data using a smooth function s() with mgcv::gam()

gam_model <- gam(Sources ~ s(SampleDepth), data = isit2)

. . .


Family: gaussian 
Link function: identity 

Formula:
Sources ~ s(SampleDepth)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.8937     0.2471   52.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
s(SampleDepth) 8.908  8.998 214.1  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.81   Deviance explained = 81.4%
GCV = 28.287  Scale est. = 27.669    n = 453

We can try to build a more appropriate model by fitting the data with a smoothed (non-linear) term. In mgcv::gam(), smooth terms are specified by expressions of the form s(x), where \(x\) is the non-linear predictor variable we want to smooth. In this case, we want to apply a smooth function to SampleDepth.

The variance explained by our model has increased by more than 20% (\(R_{adj}\) = 0.81)!

2.6 Generalized Additive Models (GAM)

Note: as opposed to one fixed coefficient, \(\beta\) in linear regression, the smooth function can continually change over the range of the predictor \(x\).

When we compare the fit of the linear (red) and non-linear (blue) models, it is clear that the latter is more appropriate for our dataset.

2.7 Generalized Additive Models (GAM)

The mgcv package also includes a default plot() function to look at the smooths:

plot(gam_model)

2.8 Test for linearity using GAM

How do we test whether the non-linear model offers a significant improvement over the linear model?

We can use gam() and AIC() to compare the performance of a linear model containing x as a linear predictor to the performance of a non-linear model containing s(x) as a smooth predictor.

linear_model <- gam(Sources ~ SampleDepth, data = isit2)
smooth_model <- gam(Sources ~ s(SampleDepth), data = isit2)
AIC(linear_model, smooth_model)

                   df      AIC
linear_model  3.00000 3143.720
smooth_model 10.90825 2801.451

Here, the AIC of the smooth GAM is lower, which indicates that adding a smoothing function improves model performance. Linearity is therefore not supported by our data.

In other words, we ask whether adding a smooth function to the linear model improves the fit of the model to our dataset.

As a brief explanation, the Akaike Information Criterion (AIC) is a comparative metric of model performance, where lower scores indicate that a model is performing “better” compared to other considered models.

2.9 Test for linearity using GAM

We can use gam() and anova() to test whether an assumption of linearity is justified. To do so, we must simply set our smoothed model so that it is nested in our linear model.

linear_model <- gam(y_obs ~ x) # fit a regular linear model using gam()
nested_gam_model <- gam(y_obs ~ s(x) + x)
anova(linear_model, nested_gam_model, test = "Chisq")

Note that the model y_obs~s(x) gives exactly the same results as y_obs~s(x)+x. We used the s(x)+x to illustrate the nestedness of the model, but the +x can be omitted.

3 Exercise 1

3.1 Exercise 1

We will now try to determine whether this the data recorded in the first season should be modelled with a linear regression, or with an additive model.

Let’s repeat the comparison test with gam() and AIC() using the data recorded in the first season only:

isit1 <- subset(isit, Season == 1)

Fit a linear and smoothed GAM model to the relation between SampleDepth and Sources.
Determine if linearity is justified for this data.
How many effective degrees of freedom does the smoothed term have?

We have not discussed effective degrees of freedom (EDF) yet, but these are a key tool to help us interpret the fit of a GAM. Keep this term in mind. More on this in the next sections!

3.2 Exercise 1 - Solution

1. Fit a linear and smoothed GAM model to the relation between SampleDepth and Sources.

linear_model_s1 <- gam(Sources ~ SampleDepth, data = isit1)
smooth_model_s1 <- gam(Sources ~ s(SampleDepth), data = isit1)

3.3 Exercise 1 - Solution

2. Determine whether a linear model is appropriate for this data.

As before, visualizing the model fit on our dataset is an excellent first step to determine whether our model is performing well.

3.4 Exercise 1 - Solution

We can supplement this with a quantitative comparison of model performance using AIC() or a \(\chi^2\).

AIC(linear_model_s1, smooth_model_s1)

                      df      AIC
linear_model_s1 3.000000 2324.905
smooth_model_s1 9.644938 2121.249

The lower AIC score indicates that smooth model is performing better than the linear model, which confirms that linearity is not appropriate for our dataset.

anova(linear_model_s1, smooth_model_s1, type = "chisq")

3.5 Exercise 1 - Solution

3. How many effective degrees of freedom does the smoothed term have?

To get the effective degrees of freedom, we can simply print the model object:

smooth_model_s1


Family: gaussian 
Link function: identity 

Formula:
Sources ~ s(SampleDepth)

Estimated degrees of freedom:
7.64  total = 8.64 

GCV score: 32.13946

The effective degrees of freedom (EDF) are >> 1.

Keep this in mind, because we will be coming back to EDF soon!

4 3. How GAMs work

4.1 How GAMs work

We will now take a few minutes to look at what GAMs are doing behind the scenes. Let us first consider a simple model containing one smooth function \(f\) of one covariate, \(x\):

\[y_i = f(x_i) + \epsilon_i\]

To estimate the smooth function \(f\), we need to represented the above equation in such a way that it becomes a linear model. This can be done by defining basis functions, \(b_j(x)\), of which \(f\) is composed:

\[f(x) = \sum_{j=1}^q b_j(x) \times \beta_j\]

4.2 Example: a polynomial basis

Suppose that \(f\) is believed to be a 4th order polynomial, so that the space of polynomials of order 4 and below contains \(f\). A basis for this space would then be:

\[b_0(x)=1 \ , \quad b_1(x)=x \ , \quad b_2(x)=x^2 \ , \quad b_3(x)=x^3 \ , \quad b_4(x)=x^4\]

so that \(f(x)\) becomes:

\[f(x) = \beta_0 + x\beta_1 + x^2\beta_2 + x^3\beta_3 + x^4\beta_4\]

and the full model now becomes:

\[y_i = \beta_0 + x_i\beta_1 + x^2_i\beta_2 + x^3_i\beta_3 + x^4_i\beta_4 + \epsilon_i\]

4.3 Example: a polynomial basis

The basis functions are each multiplied by a real valued parameter, \(\beta_j\), and are then summed to give the final curve \(f(x)\). Here is an example of a polynomial basis of order 4.

4.4 Example: a cubic spline basis

A cubic spline is a curve constructed from sections of a cubic polynomial joined together so that they are continuous in value. Each section of cubic has different coefficients.

4.5 Example: a cubic spline basis

Here is a representation of a smooth function using a rank 5 cubic spline basis with knot locations at increments of 0.2:

Here, the knots are evenly spaced through the range of observed x values. However, the choice of the degree of model smoothness is controlled by the the number of knots, which was arbitrary.

Is there a better way to select the knot locations?

4.6 Controlling the degree of smoothing

Instead of controlling smoothness by altering the number of knots, we keep that fixed to size a little larger than reasonably necessary, and control the model’s smoothness by adding a “wiggleness” penalty.

So, rather than fitting the model by minimizing (as with least squares regression):

\[||y - XB||^{2}\]

it can be fit by minimizing:

\[||y - XB||^{2} + \lambda \int_0^1[f^{''}(x)]^2dx\]

As \(\lambda\) goes to infinity, the model becomes linear.

4.7 Controlling the degree of smoothing

If \(\lambda\) is too high then the data will be over smoothed, and if it is too low then the data will be under smoothed.

Ideally, it is best to choose \(\lambda\) so that the predicted \(\hat{f}\) is as close as possible to \(f\). A suitable criterion might be to choose \(\lambda\) to minimize:

\[M = 1/n \times \sum_{i=1}^n (\hat{f_i} - f_i)^2\]

Since \(f\) is unknown, \(M\) must be estimated.

The recommend methods for this are maximum likelihood (ML) or restricted maximum likelihood estimation (REML). Generalized cross validation (GCV) is another possibility.

4.8 The principle behind cross validation

fits many of the data poorly and does no better with the missing point.
fits the underlying signal quite well, smoothing through the noise and the missing datum is reasonably well predicted.
fits the noise as well as the signal and the extra variability induced causes it to predict the missing datum rather poorly.

5 4. GAM with multiple smooth terms

5.1 GAM with linear and smooth terms

GAMs make it easy to include both smooth and linear terms, multiple smoothed terms, and smoothed interactions.

For this section, we will use the ISIT data again. We will try to model the response Sources using the predictors Season and SampleDepth simultaneously.

First, we need to convert our categorical predictor (Season) into a factor variable.

head(isit)
isit$Season <- as.factor(isit$Season)

Remember this dataset from previous sections? The ISIT dataset is comprised of bioluminescence levels (Sources) in relation to depth, seasons and different stations.

5.2 GAM with linear and smooth terms

Let us start with a basic model, with one smoothed term (SampleDepth) and one categorical predictor (Season, which has 2 levels).

basic_model <- gam(
  Sources ~ Season + s(SampleDepth),
  data = isit,
  method = "REML"
)
summary(basic_model)


Family: gaussian 
Link function: identity 

Formula:
Sources ~ Season + s(SampleDepth)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.2533     0.3613   20.08   <2e-16 ***
Season2       6.1561     0.4825   12.76   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
s(SampleDepth) 8.706  8.975 184.4  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.692   Deviance explained = 69.5%
-REML = 2615.8  Scale est. = 42.413    n = 789

5.3 GAM with linear and smooth terms

What do these plots tell us about the relationship between bioluminescence, sample depth, and seasons?

par(mfrow = c(1, 2))
plot(basic_model, all.terms = TRUE)

Bioluminescence varies non-linearly across the SampleDepth gradient, with highest levels of bioluminescence at the surface, followed by a second but smaller peak just above a depth of 1500, with declining levels at deeper depths.

There is also a pronounced difference in bioluminescence between the seasons, with high levels during Season 2, compared to Season 1.

5.4 Effective degrees of freedom (EDF)

summary(basic_model)$s.table

                    edf   Ref.df        F p-value
s(SampleDepth) 8.706426 8.975172 184.3583       0

The edf shown in the s.table are the effective degrees of freedom (EDF) of the the smooth term s(SampleDepth).

Essentially, more EDF imply more complex, wiggly splines:

A value close to 1 tends to be close to a linear term.
A higher value means that the spline is more wiggly (non-linear).

. . .

In our basic model the EDF of the smooth function s(SampleDepth) are ~9, which suggests a highly non-linear curve.

5.5 Effective degrees of freedom (EDF)

Let us come back to the concept of effective degrees of freedom (EDF).

Effective degrees of freedom give us a lot of information about the relationship between model predictors and response variables.

. . .

You might recognize the term “degrees of freedom” from previous classs about linear models, but be careful!

The effective degrees of freedom of a GAM are estimated differently from the degrees of freedom in a linear regression, and are interpreted differently.

. . .

In linear regression, the model degrees of freedom are equivalent to the number of non-redundant free parameters \(p\) in the model, and the residual degrees of freedom are given by \(n-p\).

Because the number of free parameters in GAMs is difficult to define, the EDF are instead related to the smoothing parameter \(\lambda\), such that the greater the penalty, the smaller the EDF.

Let us come back to the concept of effective degrees of freedom (EDF).

5.6 Effective degrees of freedom (EDF) and \(k\)

An upper bound on the EDF is determined by the basis dimension \(k\) for each smooth function (the EDF cannot exceed \(k-1\))

In practice, the exact choice of \(k\) is arbitrary, but it should be large enough to accommodate a sufficiently complex smooth function.

We will talk about choosing \(k\) in upcoming sections.

5.7 GAM with linear and smooth terms

We can add a second term, RelativeDepth, but specify a linear relationship with Sources

two_term_model <- gam(
  Sources ~ Season + s(SampleDepth) + RelativeDepth,
  data = isit,
  method = "REML"
)
summary(two_term_model)


Family: gaussian 
Link function: identity 

Formula:
Sources ~ Season + s(SampleDepth) + RelativeDepth

Parametric coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.8083055  0.6478742  15.139  < 2e-16 ***
Season2        6.0419306  0.4767978  12.672  < 2e-16 ***
RelativeDepth -0.0014019  0.0002968  -4.723 2.76e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
s(SampleDepth) 8.699  8.974 132.5  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =    0.7   Deviance explained = 70.4%
-REML =   2612  Scale est. = 41.303    n = 789

The regression coefficient which is estimated for this new linear term, RelativeDepth, will appear in the p.table. Remember, the p.table shows information on the parametric effects (linear terms).

In the s.table, we will once again find the non-linear smoother, s(SampleDepth), and its wiggleness parameter (edf). Remember, the s.table shows information on the additive effects (non-linear terms).

5.8 GAM with linear and smooth terms

par(mfrow = c(2, 2))
plot(two_term_model, all.terms = TRUE)

5.9 GAM with multiple smooth terms

We can model RelativeDepth as a smooth term instead.

two_smooth_model <- gam(
  Sources ~ Season + s(SampleDepth) + s(RelativeDepth),
  data = isit,
  method = "REML"
)
summary(two_smooth_model)


Family: gaussian 
Link function: identity 

Formula:
Sources ~ Season + s(SampleDepth) + s(RelativeDepth)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.9378     0.3453   22.99   <2e-16 ***
Season2       4.9640     0.4782   10.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                   edf Ref.df      F p-value    
s(SampleDepth)   8.752  8.973 150.37  <2e-16 ***
s(RelativeDepth) 8.044  8.750  19.98  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.748   Deviance explained = 75.4%
-REML = 2550.1  Scale est. = 34.589    n = 789

The regression coefficient which is estimated for our only linear term, Season, will appear in the p.table.

In the s.table, we will now find two non-linear smoothers, s(SampleDepth) and s(RelativeDepth), and their wiggleness parameters (edf).

5.10 GAM with multiple smooth terms

Let us take a look at the relationships between the linear and non-linear predictors and our response variable.

par(mfrow = c(2, 2))
plot(two_smooth_model, all.terms = TRUE)

Do you think that the additional non-linear term improves the performance of our model representing the relationship between bioluminescence and relative depth?

5.11 GAM with multiple smooth terms

As before, we can compare our models with AIC to test whether the smoothed term improves our model’s performance.

AIC(basic_model, two_term_model, two_smooth_model)

                       df      AIC
basic_model      11.83374 5208.713
two_term_model   12.82932 5188.780
two_smooth_model 20.46960 5056.841

We can see that two_smooth_model has the lowest AIC value.

The best fit model includes both smooth terms for SampleDepth and RelativeDepth, and a linear term for Season.

6 Exercise 2

6.1 Exercise 2

For our second Exercise, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.

Create two new models: Add Latitude to two_smooth_model, first as a linear term, then as a smooth term.
Is Latitude an important term to include? Does Latitude have a linear or additive effect?

Use plots, coefficient tables, and the AIC() function to help you answer this question.

6.2 Exercise 2 - Solution

1. Create two new models: Add Latitude to two_smooth_model, first as a linear term, then as a smooth term.

# Add Latitude as a linear term
three_term_model <- gam(
  Sources ~
    Season + s(SampleDepth) + s(RelativeDepth) + Latitude, #<<
  data = isit,
  method = "REML"
)
# Add Latitude as a smooth term
three_smooth_model <- gam(
  Sources ~
    Season + s(SampleDepth) + s(RelativeDepth) + s(Latitude), #<<
  data = isit,
  method = "REML"
)

6.3 Exercise 2 - Solution

2. Is Latitude an important term to include? Does Latitude have a linear or additive effect?

Let us begin by plotting the the 4 effects that are now included in each model.

6.4 Exercise 2 - Solution

We should also look at our coefficient tables. What can the EDF tell us about the wiggliness of our predictors’ effects?

summary(three_smooth_model)$s.table

                      edf   Ref.df         F p-value
s(SampleDepth)   8.766891 8.975682 68.950905       0
s(RelativeDepth) 8.007411 8.730625 17.639321       0
s(Latitude)      7.431116 8.296838  8.954349       0

. . .

The EDF are all quite high for all of our smooth variables, including Latitude.

This tells us that Latitude is quite wiggly, and probably should not be included as a linear term.

6.5 Exercise 2 - Solution

Before deciding which model is “best”, we should test whether Latitude is best included as a linear or as a smooth term, using AIC():

AIC(three_smooth_model, three_term_model)

                         df      AIC
three_smooth_model 28.20032 4990.546
three_term_model   21.47683 5051.415

. . .

Our model including Latitude as a smooth term has a lower AIC score, meaning it performs better than our model including Latitude as a linear term.

. . .

But, does adding Latitude as a smooth predictor actually improve on our last “best” model (two_smooth_model)?

6.6 Exercise 2 - Solution

AIC(two_smooth_model, three_smooth_model)

                         df      AIC
two_smooth_model   20.46960 5056.841
three_smooth_model 28.20032 4990.546

Our three_smooth_model has a lower AIC score than our previous best model (two_smooth_model), which did not include Latitude.

. . .

This implies that Latitude is indeed an informative non-linear predictor of bioluminescence.

7 4. Interactions

7.1 GAM with interaction terms

There are two ways to include interactions between variables:

for two smoothed variables : s(x1, x2)
for one smoothed variable and one linear variable (either factor or continuous): use the by argument s(x1, by = x2)
When x2 is a factor, you have a smooth term that vary between different levels of x2
When x2 is continuous, the linear effect of x2 varies smoothly with x1
When x2 is a factor, the factor needs to be added as a main effect in the model

7.2 Interaction smoothed & factor variables

We will examine interaction effects to determine whether the non-linear smoother s(SampleDepth) varies across different levels of Season.

factor_interact <- gam(
  Sources ~ Season +
    s(SampleDepth, by = Season) +
    s(RelativeDepth),
  data = isit,
  method = "REML"
)

summary(factor_interact)$s.table

                            edf   Ref.df          F p-value
s(SampleDepth):Season1 6.839386 7.552045  95.119422       0
s(SampleDepth):Season2 8.744574 8.966290 154.474325       0
s(RelativeDepth)       6.987223 8.055898   6.821074       0

7.3 Interaction smoothed & factor variables

The first two panels show the interaction effect of the SampleDepth smooth and each level of our factor variable, Season.

A good question to ask participants: Do you see a difference between the two smooth curves?

The plots show some differences between the shape of the smooth terms among the two levels of Season. The most notable difference is the peak in the second panel, which tells us that there is an effect of SampleDepth between 1000 and 2000 that is important in Season 2, but does not occur in Season 1.

This hints that the interaction effect could be important to include in our model.

7.4 Interaction smoothed & factor variables

We can also visualise our model in 3D using vis.gam().

vis.gam(factor_interact, theta = 120, n.grid = 50, lwd = .4)

We can change the rotation of this plot using the theta argument.

When we plot the interaction terms, we saw differences in the shape of the SampleDepth smooth among the two levels of Season. This hints that the interaction effect could be important to include in our model.

7.5 Interaction smoothed & factor variables

These plots hint that the interaction effect could be important to include in our model.

We will perform a model comparison using AIC to determine whether the interaction term improves our model’s performance.

AIC(two_smooth_model, factor_interact)

                       df      AIC
two_smooth_model 20.46960 5056.841
factor_interact  26.99693 4878.631

The AIC of our model with a factor interaction between the SampleDepth smooth and Season has a lower AIC score.

. . .

Together, the AIC and the plots confirm that including this interaction improves our model’s performance.

7.6 Interaction: two smoothed variables

Next, we’ll look at the interactions between two smoothed terms, SampleDepth and RelativeDepth.

smooth_interact <- gam(
  Sources ~ Season + s(SampleDepth, RelativeDepth),
  data = isit,
  method = "REML"
)
summary(smooth_interact)$s.table

                                  edf Ref.df        F p-value
s(SampleDepth,RelativeDepth) 27.12521  28.77 93.91722       0

7.7 Interaction: two smoothed variables

7.8 Interaction: two smoothed variables

vis.gam(
  smooth_interact,
  view = c("SampleDepth", "RelativeDepth"),
  theta = 50,
  n.grid = 50,
  lwd = .4
)

We can see a non-linear interaction clearly, where Sources is lowest at high values of SampleDepth and RelativeDepth, but increases with RelativeDepth while SampleDepth is low.

Remember, this plot can be rotated by changing the value of the theta argument. You can change the colour of the 3D plot using the color argument. Try specifying color = "cm" in vis.gam() above!

7.9 Interaction: two smoothed variables

So, we can see an interaction effect between these smooth terms in these plots.

Does including the interaction between s(SampleDepth) and s(RelativeDepth) improve our two_smooth_model model?

AIC(two_smooth_model, smooth_interact)

                       df      AIC
two_smooth_model 20.46960 5056.841
smooth_interact  30.33625 4943.890

. . .

The model with the interaction between s(SampleDepth) and s(RelativeDepth) has a lower AIC.

This means that the non-linear interaction improves our model’s performance, and our ability to understand the drivers of bioluminescence!

8 5. Generalization of the additive model

8.1 Generalization of the additive model

The basic additive model can be extended in the following ways:

Using other distributions for the response variable with the family argument (just as in a GLM),
Using different kinds of basis functions,
Using different kinds of random effects to fit mixed effect models.

We will now go over these aspects.

8.2 Generalized additive models

We have so far worked with simple (Gaussian) additive models, the non-linear equivalent to a linear model.

. . .

But, what can we do if: - the observations of the response variable do not follow a normal distribution? - the variance is not constant? (heteroscedasticity)

. . .

These cases are very common!

Just like generalized linear models (GLM), we can formulate generalized additive models to deal with these issues.

8.3 Generalized additive models

Let us return to our smooth interaction model for the bioluminescence data:

smooth_interact <- gam(
  Sources ~ Season + s(SampleDepth, RelativeDepth),
  data = isit,
  method = "REML"
)

summary(smooth_interact)$p.table

            Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 8.077356  0.4235432 19.070912 1.475953e-66
Season2     4.720806  0.6559436  7.196969 1.480113e-12

summary(smooth_interact)$s.table

                                  edf Ref.df        F p-value
s(SampleDepth,RelativeDepth) 27.12521  28.77 93.91722       0

9 6. GAM validation

9.1 GAM validation

As with a GLM, it is essential to check whether the model is correctly specified, especially in regards to the distribution of the response variable.

We need to verify:

The choice of basis dimension k.
The residuals plots (just as for a GLM).
The concurvity (equivalent of collinearity for smooth terms)

. . .

Useful functions included in mgcv:

k.check() performs a basis dimension check.
gam.check() produces residual plots (and also calls k.check()).
concurvity() produces summary measures of concurvity between gam components.

9.2 Selecting \(k\) basis dimensions

Remember the smoothing parameter \(\lambda\), which constrains the wiggliness of our smoothing functions?

This wiggliness is further controlled by the basis dimension \(k\), which sets the number of basis functions used to create a smooth function.

The more basis functions used to build a smooth function, the more wiggly the smooth:

In previous sections, we discussed the role of the smoothing parameter \(\lambda\) in controlling the wiggliness of our smoothing functions. This wiggliness is further controlled by the basis dimension \(k\), which sets the number of basis functions used to create a smooth function.

Each smooth in a GAM essentially the weighted sum of many smaller functions, called basis functions. The more basis functions used to build a smooth function, the more wiggly the smooth. As you can see below, a smooth with a small \(k\) basis dimension will be less wiggly than a smooth with a high \(k\) basis dimension.

9.3 Selecting \(k\) basis dimensions

The key to getting a good model fit is therefore to balance the trade-off between two things:

The smoothing parameter \(\lambda\), which penalizes wiggliness;
The basis dimension \(k\), which allows the model to wiggle according to our data.

Have we optimized the tradeoff between smoothness ( \(\lambda\) ) and wiggliness ( \(k\) ) in our model?

In other words, have we chosen a k that is large enough?

k.check(smooth_interact)

                             k'      edf   k-index p-value
s(SampleDepth,RelativeDepth) 29 27.12521 0.9448883    0.08

. . .

The EDF are very close to k, which means the wiggliness of the model is being overly constrained. .alert[The tradeoff between smoothness and wiggliness is not balanced].

Here we essentially ask: is our model wiggly enough?

The EDF are very close to k. This means the wiggliness of the model is being overly constrained by the default k, and could fit the data better with greater wiggliness. In other words, the tradeoff between smoothness and wiggliness is not balanced.

9.4 Selecting \(k\) basis dimensions

We can refit the model with a larger k:

            Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 8.129512  0.4177659 19.459491 1.911267e-68
Season2     4.629964  0.6512205  7.109672 2.741865e-12

                                  edf   Ref.df        F p-value
s(SampleDepth,RelativeDepth) 46.03868 54.21371 55.19817       0

. . .

Is k large enough this time?

k.check(smooth_interact_k60)

                             k'      edf  k-index p-value
s(SampleDepth,RelativeDepth) 59 46.03868 1.048626  0.8825

. . .

The EDF are much smaller than k. We can replace our previous model with this wigglier version:

smooth_interact <- smooth_interact_k60

The EDF are much smaller than k, which means this model fits the data better with additional wiggliness. We can replace our previous model with this wigglier version:

9.5 Choosing a distribution

As with any Normal model, we must check some model assumptions before continuing.

We can look at the residual plots with gam.check():

par(mfrow = c(2, 2))
gam.check(smooth_interact)

In addition to the plots, gam.check() also provides the output of k.check().

Recall: We can evaluate the distribution of the model residuals to verify these assumptions, just as we would do for a GLM (see class 6).

9.6 Choosing a distribution

. . .

Pronounced heteroscedasticity and some strong outliers in the residuals

These plots are a little different than those produced by plot for a linear model (e.g. no leverage plot)
Participants should already be familiar with residual plots (they are explaned more in detail in class 4 and class 6.

These residual plots highlight some problems: - Panel 2: The variance of the error is not constant (heteroscedasticity). - Panels 1 and 4: There are a few strong outlier patterns in this dataset.

9.7 Choosing a distribution

For our interaction model, we need a probability distribution that allows the variance to increase with the mean.

. . .

One family of distributions that has this property and that works well in a GAM is the Tweedie family.

A common link function for Tweedie distributions is the \(log\).

. . .

As in a GLM, we can use the family = argument in gam() to fit models with other distributions (including distributions such as binomial, poisson, gamma etc.).

To get an overview of families available in mgcv:

?family.mgcv

10 Exercise 3

10.1 Exercise 3

Fit a new model smooth_interact_tw with the same formula as the smooth_interact model but with a distribution from the Tweedie family (instead of the normal distribution) and log link function. You can do so by using family = tw(link = "log") inside gam().
Check the choice of k and the residual plots for the new model.
Compare smooth_interact_tw with smooth_interact. Which one would you choose?

. . .

Hint:

# Here is how we would write the model to specify the Normal distribution:
smooth_interact <- gam(
  Sources ~ Season + s(SampleDepth, RelativeDepth, k = 60),
  family = gaussian(link = "identity"),
  data = isit,
  method = "REML"
)

10.2 Exercise 3 - Solution

1. First, let us fit a new model with the Tweedie distribution and a log link function.

smooth_interact_tw <- gam(
  Sources ~ Season + s(SampleDepth, RelativeDepth, k = 60),
  family = tw(link = "log"),
  data = isit,
  method = "REML"
)
summary(smooth_interact_tw)$p.table

             Estimate Std. Error  t value      Pr(>|t|)
(Intercept) 1.3126641 0.03400390 38.60333 8.446478e-180
Season2     0.5350529 0.04837342 11.06089  1.961733e-26

summary(smooth_interact_tw)$s.table

                                  edf   Ref.df        F p-value
s(SampleDepth,RelativeDepth) 43.23949 51.57139 116.9236       0

10.3 Exercise 3 - Solution

2. Check the choice of k and the residual plots for the new model.

Next, we should check the basis dimension:

k.check(smooth_interact_tw)

                             k'      edf  k-index p-value
s(SampleDepth,RelativeDepth) 59 43.23949 1.015062  0.8125

The trade-off between \(k\) and EDF is well balanced. Great!

10.4 Exercise 3 - Solution

We should also verify the residual plots, to validate the Tweedie distribution:

par(mfrow = c(2, 2))
gam.check(smooth_interact_tw)

The residual plots do look much better, but it is clear that something is missing from the model. This could be a spatial affect (longtitude and latitude), or a random effect (e.g. based on Station).

10.5 Exercise 3 - Solution

3. Compare smooth_interact_tw with smooth_interact. Which one would you choose?

AIC(smooth_interact, smooth_interact_tw)

                         df      AIC
smooth_interact    49.47221 4900.567
smooth_interact_tw 47.86913 3498.490

AIC allows us to compare models that are based on different distributions!

. . .

Using a Tweedie instead of a Normal distribution greatly improves our model!

The AIC score for smooth_interact_tw is much smaller than the AIC score for the smooth_interact. Using a Tweedie instead of a Normal distribution greatly improves our model!

11 6. Changing the basis

11.1 Other smooth functions

To model a non-linear smooth variable or surface, smooth functions can be built in different ways:

s() → for modelling a 1-dimensional smooth or for modeling interactions among variables measured using the same unit and the same scale

te() → for modelling 2- or n-dimensional interaction surfaces of variables that are not on the same scale. Includes main effects.

ti() → for modelling 2- or n-dimensional interaction surfaces that do not include the main effects.

11.2 Parameters of smooth functions

The smooth functions have several parameters that can be set to change their behaviour. The most common parameters are :

k → basis dimension - determines the upper bound of the number of base functions used to build the curve. - constrains the wigglyness of a smooth. - k should be < the number of unique data points. - the complexity (i.e. non-linearity) of a smooth function in a fitted model is reflected by its effective degrees of freedom (EDF).

11.3 Parameters of smooth functions

The smooth functions have several parameters that can be set to change their behaviour. The most often-used parameters are:

d → specifies that predictors in the interaction are on the same scale or dimension (only used in te() and ti()). - For example, in te(Time, width, height, d=c(1,2)), indicates that width and height are one the same scale, but not Time.

bs → specifies the type of basis functions. - the default for s() is tp (thin plate regression spline) and for te() and ti() is cr (cubic regression spline).

11.4 Example: Cyclical data

Let’s use a time series of climate data, with monthly measurements, to find a temporal trend in yearly temperature. We’ll use the Nottingham temperature time series for this, which is included in R:

data(nottem)

Let’s plot the monthly temperature fluctuations for every year in the nottem dataset:

# the number of years of data (20 years)
n_years <- length(nottem) / 12

# categorical variable coding for the 12 months of the year, for every
# year sampled (so, a sequence 1 to 12 repeated for 20 years).
nottem_month <- rep(1:12, times = n_years)

# the year corresponding to each month in nottem_month
nottem_year <- rep(1920:(1920 + n_years - 1), each = 12)

See ?nottem for a more complete description of this dataset.

Using the nottem data, we have created three new vectors: - n_years corresponds to the number of years of data (20 years) - nottem_month is a categorical variable coding for the 12 months of the year, for every year sampled (so, a sequence 1 to 12 repeated for 20 years). - nottem_year is a variable containing the year corresponding to each month in nottem_month.

11.5 Example: Cyclical data

11.6 Example: Cyclical data

We can model both the cyclic change of temperature across months and the non-linear trend through years, using a cyclical cubic spline, or cc, for the month variable and a regular smooth for the year variable.

year_gam <- gam(
  nottem ~ s(nottem_year) + s(nottem_month, bs = "cc"),
  method = "REML"
)
summary(year_gam)$s.table

                     edf   Ref.df          F    p-value
s(nottem_year)  1.621375 2.011475   2.850888 0.06141004
s(nottem_month) 6.855132 8.000000 393.119285 0.00000000

11.7 Example: Cyclical data

plot(year_gam, page = 1, scale = 0)

There is about 1-1.5 degree rise in temperature over the period, but within a given year there is about 20 degrees variation in temperature, on average. The actual data vary around these values and that is the unexplained variance.

Here we can see one of the very interesting bonuses of using GAMs. We can either plot the response surface (fitted values) or the terms (contribution of each covariate) as shown here. You can imagine these as plots of the changing regression coefficients, and how their contribution (or effect size) varies over time. In the first plot, we see that positive contributions of temperature occurred post-1930.

12 7. Quick introduction to GAMMs

12.1 Dealing with non-independence

When observations are not independent, GAMs can be used to either incorporate:

a correlation structure to model autocorrelated residuals, such as:
the autoregressive (AR) model
the moving average model (MA); or,
a combination of both models (ARMA).
random effects that model independence between observations at the same site.- random effects that model independence among observations from the same site.

12.2 Autocorrelation of residuals

Autocorrelation of residuals refers to the degree of correlation between the residuals (the differences between actual and predicted values) in a time series model.

In other words, if there is autocorrelation of residuals in a time series model, it means that there is a pattern or relationship between the residuals at one time and the residuals at other times.

. . .

Autocorrelation of residuals is usually measured using the ACF (autocorrelation function) and pACF (partial autocorrelation function) graphs, which show the correlation between residuals at different lags.

If the ACF or pACF graphs show significant correlations at non-zero lags, there is evidence of autocorrelation in the residuals and the model may need to be modified or improved to better capture the underlying patterns in the data.

. . .

Let’s see how this works with our year_gam model!

12.3 Model with correlated errors

Let’s have a look at a model with temporal autocorrelation in the residuals. We will revisit the Nottingham temperature model and test for correlated errors using the (partial) autocorrelation function.

par(mfrow = c(1, 2))

acf(resid(year_gam), lag.max = 36, main = "ACF")

pacf(resid(year_gam), lag.max = 36, main = "pACF")

12.4 Model with correlated errors

ACF evaluates the cross correlation and pACF the partial correlation of a time series with itself at different time lags (i.e. similarity between observations at increasingly large time lags).

The ACF and pACF plots are thus used to identify the time steps are needed before observations are no longer autocorrelated.

. . .

The ACF plot of our model residuals suggests a significant lag of 1, and perhaps a lag of 2. Therefore, a low-order AR model is likely needed.

The autocorrelation function (ACF; first panel above) evaluates the cross correlation of a time series with itself at different time lags (i.e. similarity between observations at increasingly large time lags).

In contrast, the partial autocorrelation function (PACF: second panel above) gives the partial correlation of a time series with its own lagged values, after controlling for the values of the time series at all shorter lags.

In the ACF graph, the blue shaded region represents the confidence interval and the red dashed lines represent the limits of statistical significance.

12.5 Model with correlated errors

We can add AR structures to the model:

AR(1): correlation at 1 time step, or
AR(2) correlation at 2 time steps.

. . .

df <- data.frame(nottem, nottem_year, nottem_month)

year_gam <- gamm(
  nottem ~ s(nottem_year) + s(nottem_month, bs = "cc"),
  data = df
)

year_gam_AR1 <- gamm(
  nottem ~ s(nottem_year) + s(nottem_month, bs = "cc"),
  correlation = corARMA(form = ~ 1 | nottem_year, p = 1),
  data = df
)

year_gam_AR2 <- gamm(
  nottem ~ s(nottem_year) + s(nottem_month, bs = "cc"),
  correlation = corARMA(form = ~ 1 | nottem_year, p = 2),
  data = df
)

12.6 Model with correlated errors

Which of these models performs the best?

AIC(year_gam$lme, year_gam_AR1$lme, year_gam_AR2$lme)

                 df      AIC
year_gam$lme      5 1109.908
year_gam_AR1$lme  6 1101.218
year_gam_AR2$lme  7 1101.598

The AR(1) provides a significant increase in fit over the naive model (year_gam), but there is very little improvement in moving to the AR(2).

So, it is best to include only the AR(1) structure in our model.

12.7 Random effects

As we saw in the section about changing the basis, bs specifies the type of underlying base function. For random intercepts and linear random slopes we use bs = "re", but for random smooths we use bs = "fs".

. . .

3 different types of random effects in GAMMs (fac → factor coding for the random effect; x0 → continuous fixed effect):

random intercepts adjust the height of other model terms with a constant value: s(fac, bs = "re")
random slopes adjust the slope of the trend of a numeric predictor: s(fac, x0, bs = "re")
random smooths adjust the trend of a numeric predictor in a nonlinear way: s(x0, fac, bs = "fs", m = 1), where the argument m=1 sets a heavier penalty for the smooth moving away from 0, causing shrinkage to the mean.

12.8 GAMM with a random intercept

We will use the gamSim() function to generate a dataset with a random effect, then run a model with a random intercept using fac as the random factor.

gam_data2 <- gamSim(eg = 6)

4 term additive + random effectGu & Wahba 4 term additive model

# run random intercept model
gamm_intercept <- gam(
  y ~ s(x0) + s(fac, bs = "re"),
  data = gam_data2,
  method = "REML"
)

# examine model output
summary(gamm_intercept)$s.table

            edf   Ref.df         F   p-value
s(x0)  2.517203 3.141975  1.985496 0.1103459
s(fac) 2.969516 3.000000 97.518258 0.0000000

12.9 GAMM with a random intercept

plot(gamm_intercept, select = 2)

12.10 GAMM with a random intercept

We can plot the summed effects for the x0 without random effects, and then plot the predictions of all 4 levels of the random fac effect:

fac1 fac2 fac3 fac4

12.11 GAMM with a random slope

Next, we will run and plot a model with a random slope:

gamm_slope <- gam(
  y ~ s(x0) + s(x0, fac, bs = "re"),
  data = gam_data2,
  method = "REML"
)

summary(gamm_slope)$s.table

               edf   Ref.df         F   p-value
s(x0)     2.668737 3.329694  1.730252 0.1379086
s(x0,fac) 2.951988 3.000000 62.210584 0.0000000

12.12 GAMM with a random slope

12.13 GAMM with a random intercept and slope

gamm_int_slope <- gam(
  y ~ s(x0) + s(fac, bs = "re") + s(fac, x0, bs = "re"),
  data = gam_data2,
  method = "REML"
)

summary(gamm_int_slope)$s.table

                   edf   Ref.df            F   p-value
s(x0)     2.5172027233 3.141973 1.985541e+00 0.1103390
s(fac)    2.9694888330 3.000000 9.760135e+01 0.0000000
s(fac,x0) 0.0009910603 3.000000 3.353325e-04 0.4169633

12.14 GAMM with a random intercept and slope

12.15 GAMM with a random smooth

Lastly, we will examine a model with a random smooth.

gamm_smooth <- gam(
  y ~ s(x0) + s(x0, fac, bs = "fs", m = 1),
  data = gam_data2,
  method = "REML"
)

summary(gamm_smooth)$s.table

               edf    Ref.df        F   p-value
s(x0)     2.462632  3.051219 1.794069 0.1451071
s(x0,fac) 4.709845 35.000000 8.460955 0.0000000

12.16 GAMM with a random smooth

plot(gamm_smooth, select = 1)

select = 1 because the smooth slope appears as the first entry in your summary table.

12.17 GAMM with a random smooth

Here, if the random slope varied along x0, we would see different curves for each level.

12.18 GAMM

All of the mixed models from this section can be compared using AIC() to determine the best fit model

AIC(gamm_intercept, gamm_slope, gamm_int_slope, gamm_smooth)

                      df      AIC
gamm_intercept  8.141635 2205.121
gamm_slope      8.328850 2271.726
gamm_int_slope  8.143560 2205.123
gamm_smooth    11.150854 2207.386

The best model among those we have built here would be a GAMM with a random effect on the intercept.

13 Final info

13.1 Resources

This class was a brief introduction to basic concepts, and popular packages to help you estimate, evaluate, and visualise GAMs in R, but there is so much more to explore!

* The book Generalized Additive Models: An Introduction with R by Simon Wood (the author of the mgcv package). * Simon Wood’s website, maths.ed.ac.uk/~swood34/. * Gavin Simpson’s blog, From the bottom of the heap. * Gavin Simpson’s package gratia for GAM visualisation in ggplot2. * Generalized Additive Models: An Introduction with R course by Noam Ross. * Overview GAMM analysis of time series data tutorial by Jacolien van Rij. * Hierarchical generalized additive models in ecology: an introduction with mgcv by Pedersen et al. (2019).

Finally, the help pages, available through ?gam in R, are an excellent resource.

Those slides were adapted from the QCBS workshop series https://r.qcbs.ca/workshops/

Happy modelling