Generalized Additive Models
An introduction
0.1 Required packages and datasets
For this class, we will be working with the following dataset:
You should also make sure you have downloaded, installed, and loaded the following packages:
install.packages(c('ggplot2', 'itsadug', 'mgcv'))0.2 Class overview
- The linear model… and where it fails
- Introduction to GAMs
- How GAMs work
- GAMs with multiple smooth terms
- GAMs with interaction terms
- GAM validation
- Choosing another distribution
- Changing the basis function
- Quick introduction to GAMMs (Mixed effect GAMs)
0.3 Learning objectives
- Use the
mgcvpackage to fit non-linear relationships, - Understand the output of a Generalized Additive Model (GAM) to help you understand your data,
- Use tests to determine if a non-linear model fits better than a linear one,
- Include smooth interactions between variables,
- Understand the idea of a basis function, and why it makes GAMs so powerful,
- Account for dependence in data (autocorrelation, hierarchical structure) using GAMMs.
1 The linear model
… and where it fails
1.1 Linear regression
Linear model with multiple predictors:
\[y_i = \beta_0 + \beta_1x_{1,i}+\beta_2x_{2,i}+\beta_3x_{3,i}+...+\beta_kx_{k,i} + \epsilon_i\]
. . .
As we saw in previous class, a linear model is based on four major assumptions:
- Linear relationship between response and predictor variables
- Normally distributed error: \[\epsilon_i \sim \mathcal{N}(0,\,\sigma^2)\]
- Homogeneity of the errors’ variance
- Independence of the errors (homoscedasticity)
A linear model can easily accommodate certain types of non-linear responses (e.g. \(x^2\)) but this approach strongly relies on (arbitrary or well-informed) choices, and is less flexible than using an additive model.
1.2 Linear regression
Linear models work very well in certain specific cases where all these criteria are met:
1.3 Linear regression
In reality, we often cannot meet these criteria. In many cases, linear models are inappropriate:
1.4 Linear regression
What’s the problem and do we fix it?
A linear model tries to fit the best straight line that passes through the data, so it doesn’t work well for all datasets.
In contrast, a GAM can capture complexe relationships by fitting a non-linear smooth function through the data, while controlling how wiggly the smooth can get (more on this later).
2 Introduction to GAMs
2.1 Generalized Additive Models (GAM)
Let us use an example to demonstrate the difference between a linear regression and an additive model.
First, we will load the ISIT dataset.
This dataset is comprised of bioluminescence levels (
Sources) in relation to depth, seasons and different stations.
For now, we will be focusing on Season 2.
isit2 <- subset(isit, Season == 2)Let’s begin by trying to fit a linear regression model to the relationship between Sources and SampleDepth.
linear_model <- gam(Sources ~ SampleDepth, data = isit2)2.2 Generalized Additive Models (GAM)
summary(linear_model)
Family: gaussian
Link function: identity
Formula:
Sources ~ SampleDepth
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.9021874 0.7963891 38.80 <2e-16 ***
SampleDepth -0.0083450 0.0003283 -25.42 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.588 Deviance explained = 58.9%
GCV = 60.19 Scale est. = 59.924 n = 453
The linear model is explaining quite a bit of variance in our dataset ( \(R_{adj}\) = 0.588), which means it’s a pretty good model, right?
2.3 Generalized Additive Models (GAM)
Well, let’s take a look at how our model fits the data:
data_plot <- ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) +
geom_point() +
geom_line(aes(y = fitted(linear_model)), colour = "red", size = 1.2) +
theme_bw()
data_plotAre the assumptions of linear regression met in this case? As you may have noticed, we are violating the assumptions of the linear model:
- There is a strong non-linear relationship between
SourcesandSampleDepth. - The error is not normally distributed.
- The variance of the error is not homoscedastic.
- The errors are not independent of each other.
2.4 Generalized Additive Models (GAM)
Relationship between the response variable and predictor variables
One predictor variable: \[y_i = \beta_0 + f(x_i) + \epsilon\]
Multiple predictor variables: \[y_i = \beta_0 + f_1(x_{1,i}) + f_2(x_{2,i}) + ... + \epsilon\]
One big advantage of using a GAM over a manual specification of the model is that the optimal shape, i.e. the degree of smoothness of \(f(x)\), is determined automatically depending on the fitting method (usually maximum likelihood).
Strictly speaking, the equations are for a Gaussian GAM with identity link, which is also called “additive model” (without “generalized”).
Importantly, given that the smooth function \(f(x_i)\) is non-linear and local, the magnitude of the effect of the explanatory variable can vary over its range, depending on the relationship between the variable and the response.
That is, as opposed to one fixed coefficient \(\beta\), the function \(f\) can continually change over the range of \(x_i\). The degree of smoothness (or wiggliness) of \(f\) is controlled using penalized regression determined automatically in mgcv using a generalized cross-validation (GCV) routine.
2.5 Generalized Additive Models (GAM)
Let’s try to fit a model to the data using a smooth function s() with mgcv::gam()
gam_model <- gam(Sources ~ s(SampleDepth), data = isit2). . .
Family: gaussian
Link function: identity
Formula:
Sources ~ s(SampleDepth)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.8937 0.2471 52.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(SampleDepth) 8.908 8.998 214.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.81 Deviance explained = 81.4%
GCV = 28.287 Scale est. = 27.669 n = 453
We can try to build a more appropriate model by fitting the data with a smoothed (non-linear) term. In mgcv::gam(), smooth terms are specified by expressions of the form s(x), where \(x\) is the non-linear predictor variable we want to smooth. In this case, we want to apply a smooth function to SampleDepth.
The variance explained by our model has increased by more than 20% (\(R_{adj}\) = 0.81)!
2.6 Generalized Additive Models (GAM)
Note: as opposed to one fixed coefficient, \(\beta\) in linear regression, the smooth function can continually change over the range of the predictor \(x\).
When we compare the fit of the linear (red) and non-linear (blue) models, it is clear that the latter is more appropriate for our dataset.
2.7 Generalized Additive Models (GAM)
The mgcv package also includes a default plot() function to look at the smooths:
plot(gam_model)2.8 Test for linearity using GAM
How do we test whether the non-linear model offers a significant improvement over the linear model?
We can use gam() and AIC() to compare the performance of a linear model containing x as a linear predictor to the performance of a non-linear model containing s(x) as a smooth predictor.
linear_model <- gam(Sources ~ SampleDepth, data = isit2)
smooth_model <- gam(Sources ~ s(SampleDepth), data = isit2)
AIC(linear_model, smooth_model) df AIC
linear_model 3.00000 3143.720
smooth_model 10.90825 2801.451
Here, the AIC of the smooth GAM is lower, which indicates that adding a smoothing function improves model performance. Linearity is therefore not supported by our data.
In other words, we ask whether adding a smooth function to the linear model improves the fit of the model to our dataset.
As a brief explanation, the Akaike Information Criterion (AIC) is a comparative metric of model performance, where lower scores indicate that a model is performing “better” compared to other considered models.
2.9 Test for linearity using GAM
We can use gam() and anova() to test whether an assumption of linearity is justified. To do so, we must simply set our smoothed model so that it is nested in our linear model.
linear_model <- gam(y_obs ~ x) # fit a regular linear model using gam()
nested_gam_model <- gam(y_obs ~ s(x) + x)
anova(linear_model, nested_gam_model, test = "Chisq")3 Exercise 1
3.1 Exercise 1
We will now try to determine whether this the data recorded in the first season should be modelled with a linear regression, or with an additive model.
Let’s repeat the comparison test with gam() and AIC() using the data recorded in the first season only:
isit1 <- subset(isit, Season == 1)- Fit a linear and smoothed GAM model to the relation between
SampleDepthandSources. - Determine if linearity is justified for this data.
- How many effective degrees of freedom does the smoothed term have?
We have not discussed effective degrees of freedom (EDF) yet, but these are a key tool to help us interpret the fit of a GAM. Keep this term in mind. More on this in the next sections!
3.2 Exercise 1 - Solution
1. Fit a linear and smoothed GAM model to the relation between SampleDepth and Sources.
linear_model_s1 <- gam(Sources ~ SampleDepth, data = isit1)
smooth_model_s1 <- gam(Sources ~ s(SampleDepth), data = isit1)3.3 Exercise 1 - Solution
2. Determine whether a linear model is appropriate for this data.
As before, visualizing the model fit on our dataset is an excellent first step to determine whether our model is performing well.
3.4 Exercise 1 - Solution
We can supplement this with a quantitative comparison of model performance using AIC() or a \(\chi^2\).
AIC(linear_model_s1, smooth_model_s1) df AIC
linear_model_s1 3.000000 2324.905
smooth_model_s1 9.644938 2121.249
The lower AIC score indicates that smooth model is performing better than the linear model, which confirms that linearity is not appropriate for our dataset.
anova(linear_model_s1, smooth_model_s1, type = "chisq")3.5 Exercise 1 - Solution
3. How many effective degrees of freedom does the smoothed term have?
To get the effective degrees of freedom, we can simply print the model object:
smooth_model_s1
Family: gaussian
Link function: identity
Formula:
Sources ~ s(SampleDepth)
Estimated degrees of freedom:
7.64 total = 8.64
GCV score: 32.13946
The effective degrees of freedom (EDF) are >> 1.
Keep this in mind, because we will be coming back to EDF soon!
4 3. How GAMs work
4.1 How GAMs work
We will now take a few minutes to look at what GAMs are doing behind the scenes. Let us first consider a simple model containing one smooth function \(f\) of one covariate, \(x\):
\[y_i = f(x_i) + \epsilon_i\]
To estimate the smooth function \(f\), we need to represented the above equation in such a way that it becomes a linear model. This can be done by defining basis functions, \(b_j(x)\), of which \(f\) is composed:
\[f(x) = \sum_{j=1}^q b_j(x) \times \beta_j\]
4.2 Example: a polynomial basis
Suppose that \(f\) is believed to be a 4th order polynomial, so that the space of polynomials of order 4 and below contains \(f\). A basis for this space would then be:
\[b_0(x)=1 \ , \quad b_1(x)=x \ , \quad b_2(x)=x^2 \ , \quad b_3(x)=x^3 \ , \quad b_4(x)=x^4\]
so that \(f(x)\) becomes:
\[f(x) = \beta_0 + x\beta_1 + x^2\beta_2 + x^3\beta_3 + x^4\beta_4\]
and the full model now becomes:
\[y_i = \beta_0 + x_i\beta_1 + x^2_i\beta_2 + x^3_i\beta_3 + x^4_i\beta_4 + \epsilon_i\]
4.3 Example: a polynomial basis
The basis functions are each multiplied by a real valued parameter, \(\beta_j\), and are then summed to give the final curve \(f(x)\). Here is an example of a polynomial basis of order 4.
4.4 Example: a cubic spline basis
A cubic spline is a curve constructed from sections of a cubic polynomial joined together so that they are continuous in value. Each section of cubic has different coefficients.
4.5 Example: a cubic spline basis
Here is a representation of a smooth function using a rank 5 cubic spline basis with knot locations at increments of 0.2:
Here, the knots are evenly spaced through the range of observed x values. However, the choice of the degree of model smoothness is controlled by the the number of knots, which was arbitrary.
Is there a better way to select the knot locations?
4.6 Controlling the degree of smoothing
Instead of controlling smoothness by altering the number of knots, we keep that fixed to size a little larger than reasonably necessary, and control the model’s smoothness by adding a “wiggleness” penalty.
So, rather than fitting the model by minimizing (as with least squares regression):
\[||y - XB||^{2}\]
it can be fit by minimizing:
\[||y - XB||^{2} + \lambda \int_0^1[f^{''}(x)]^2dx\]
As \(\lambda\) goes to infinity, the model becomes linear.
4.7 Controlling the degree of smoothing
If \(\lambda\) is too high then the data will be over smoothed, and if it is too low then the data will be under smoothed.
Ideally, it is best to choose \(\lambda\) so that the predicted \(\hat{f}\) is as close as possible to \(f\). A suitable criterion might be to choose \(\lambda\) to minimize:
\[M = 1/n \times \sum_{i=1}^n (\hat{f_i} - f_i)^2\]
Since \(f\) is unknown, \(M\) must be estimated.
The recommend methods for this are maximum likelihood (ML) or restricted maximum likelihood estimation (REML). Generalized cross validation (GCV) is another possibility.
4.8 The principle behind cross validation
fits many of the data poorly and does no better with the missing point.
fits the underlying signal quite well, smoothing through the noise and the missing datum is reasonably well predicted.
fits the noise as well as the signal and the extra variability induced causes it to predict the missing datum rather poorly.
5 4. GAM with multiple smooth terms
5.1 GAM with linear and smooth terms
GAMs make it easy to include both smooth and linear terms, multiple smoothed terms, and smoothed interactions.
For this section, we will use the ISIT data again. We will try to model the response Sources using the predictors Season and SampleDepth simultaneously.
First, we need to convert our categorical predictor (Season) into a factor variable.
Remember this dataset from previous sections? The ISIT dataset is comprised of bioluminescence levels (Sources) in relation to depth, seasons and different stations.
5.2 GAM with linear and smooth terms
Let us start with a basic model, with one smoothed term (SampleDepth) and one categorical predictor (Season, which has 2 levels).
basic_model <- gam(
Sources ~ Season + s(SampleDepth),
data = isit,
method = "REML"
)
summary(basic_model)
Family: gaussian
Link function: identity
Formula:
Sources ~ Season + s(SampleDepth)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2533 0.3613 20.08 <2e-16 ***
Season2 6.1561 0.4825 12.76 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(SampleDepth) 8.706 8.975 184.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.692 Deviance explained = 69.5%
-REML = 2615.8 Scale est. = 42.413 n = 789
5.3 GAM with linear and smooth terms
What do these plots tell us about the relationship between bioluminescence, sample depth, and seasons?
Bioluminescence varies non-linearly across the SampleDepth gradient, with highest levels of bioluminescence at the surface, followed by a second but smaller peak just above a depth of 1500, with declining levels at deeper depths.
There is also a pronounced difference in bioluminescence between the seasons, with high levels during Season 2, compared to Season 1.
5.4 Effective degrees of freedom (EDF)
summary(basic_model)$s.table edf Ref.df F p-value
s(SampleDepth) 8.706426 8.975172 184.3583 0
The edf shown in the s.table are the effective degrees of freedom (EDF) of the the smooth term s(SampleDepth).
Essentially, more EDF imply more complex, wiggly splines:
A value close to 1 tends to be close to a linear term.
A higher value means that the spline is more wiggly (non-linear).
. . .
In our basic model the EDF of the smooth function s(SampleDepth) are ~9, which suggests a highly non-linear curve.
5.5 Effective degrees of freedom (EDF)
Let us come back to the concept of effective degrees of freedom (EDF).
Effective degrees of freedom give us a lot of information about the relationship between model predictors and response variables.
. . .
You might recognize the term “degrees of freedom” from previous classs about linear models, but be careful!
The effective degrees of freedom of a GAM are estimated differently from the degrees of freedom in a linear regression, and are interpreted differently.
. . .
In linear regression, the model degrees of freedom are equivalent to the number of non-redundant free parameters \(p\) in the model, and the residual degrees of freedom are given by \(n-p\).
Because the number of free parameters in GAMs is difficult to define, the EDF are instead related to the smoothing parameter \(\lambda\), such that the greater the penalty, the smaller the EDF.
Let us come back to the concept of effective degrees of freedom (EDF).
5.6 Effective degrees of freedom (EDF) and \(k\)
An upper bound on the EDF is determined by the basis dimension \(k\) for each smooth function (the EDF cannot exceed \(k-1\))
In practice, the exact choice of \(k\) is arbitrary, but it should be large enough to accommodate a sufficiently complex smooth function.
We will talk about choosing \(k\) in upcoming sections.
5.7 GAM with linear and smooth terms
We can add a second term, RelativeDepth, but specify a linear relationship with Sources
two_term_model <- gam(
Sources ~ Season + s(SampleDepth) + RelativeDepth,
data = isit,
method = "REML"
)
summary(two_term_model)
Family: gaussian
Link function: identity
Formula:
Sources ~ Season + s(SampleDepth) + RelativeDepth
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.8083055 0.6478742 15.139 < 2e-16 ***
Season2 6.0419306 0.4767978 12.672 < 2e-16 ***
RelativeDepth -0.0014019 0.0002968 -4.723 2.76e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(SampleDepth) 8.699 8.974 132.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.7 Deviance explained = 70.4%
-REML = 2612 Scale est. = 41.303 n = 789
The regression coefficient which is estimated for this new linear term, RelativeDepth, will appear in the p.table. Remember, the p.table shows information on the parametric effects (linear terms).
In the s.table, we will once again find the non-linear smoother, s(SampleDepth), and its wiggleness parameter (edf). Remember, the s.table shows information on the additive effects (non-linear terms).
5.8 GAM with linear and smooth terms
5.9 GAM with multiple smooth terms
We can model RelativeDepth as a smooth term instead.
two_smooth_model <- gam(
Sources ~ Season + s(SampleDepth) + s(RelativeDepth),
data = isit,
method = "REML"
)
summary(two_smooth_model)
Family: gaussian
Link function: identity
Formula:
Sources ~ Season + s(SampleDepth) + s(RelativeDepth)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9378 0.3453 22.99 <2e-16 ***
Season2 4.9640 0.4782 10.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(SampleDepth) 8.752 8.973 150.37 <2e-16 ***
s(RelativeDepth) 8.044 8.750 19.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.748 Deviance explained = 75.4%
-REML = 2550.1 Scale est. = 34.589 n = 789
The regression coefficient which is estimated for our only linear term, Season, will appear in the p.table.
In the s.table, we will now find two non-linear smoothers, s(SampleDepth) and s(RelativeDepth), and their wiggleness parameters (edf).
5.10 GAM with multiple smooth terms
Let us take a look at the relationships between the linear and non-linear predictors and our response variable.
Do you think that the additional non-linear term improves the performance of our model representing the relationship between bioluminescence and relative depth?
5.11 GAM with multiple smooth terms
As before, we can compare our models with AIC to test whether the smoothed term improves our model’s performance.
AIC(basic_model, two_term_model, two_smooth_model) df AIC
basic_model 11.83374 5208.713
two_term_model 12.82932 5188.780
two_smooth_model 20.46960 5056.841
We can see that
two_smooth_modelhas the lowest AIC value.
The best fit model includes both smooth terms for SampleDepth and RelativeDepth, and a linear term for Season.
6 Exercise 2
6.1 Exercise 2
For our second Exercise, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.
- Create two new models: Add
Latitudetotwo_smooth_model, first as a linear term, then as a smooth term. - Is
Latitudean important term to include? DoesLatitudehave a linear or additive effect?
Use plots, coefficient tables, and the
AIC()function to help you answer this question.
6.2 Exercise 2 - Solution
1. Create two new models: Add Latitude to two_smooth_model, first as a linear term, then as a smooth term.
# Add Latitude as a linear term
three_term_model <- gam(
Sources ~
Season + s(SampleDepth) + s(RelativeDepth) + Latitude, #<<
data = isit,
method = "REML"
)
# Add Latitude as a smooth term
three_smooth_model <- gam(
Sources ~
Season + s(SampleDepth) + s(RelativeDepth) + s(Latitude), #<<
data = isit,
method = "REML"
)6.3 Exercise 2 - Solution
2. Is Latitude an important term to include? Does Latitude have a linear or additive effect?
Let us begin by plotting the the 4 effects that are now included in each model.
6.4 Exercise 2 - Solution
We should also look at our coefficient tables. What can the EDF tell us about the wiggliness of our predictors’ effects?
summary(three_smooth_model)$s.table edf Ref.df F p-value
s(SampleDepth) 8.766891 8.975682 68.950905 0
s(RelativeDepth) 8.007411 8.730625 17.639321 0
s(Latitude) 7.431116 8.296838 8.954349 0
. . .
The EDF are all quite high for all of our smooth variables, including Latitude.
This tells us that Latitude is quite wiggly, and probably should not be included as a linear term.
6.5 Exercise 2 - Solution
Before deciding which model is “best”, we should test whether Latitude is best included as a linear or as a smooth term, using AIC():
AIC(three_smooth_model, three_term_model) df AIC
three_smooth_model 28.20032 4990.546
three_term_model 21.47683 5051.415
. . .
Our model including Latitude as a smooth term has a lower AIC score, meaning it performs better than our model including Latitude as a linear term.
. . .
But, does adding Latitude as a smooth predictor actually improve on our last “best” model (two_smooth_model)?
6.6 Exercise 2 - Solution
AIC(two_smooth_model, three_smooth_model) df AIC
two_smooth_model 20.46960 5056.841
three_smooth_model 28.20032 4990.546
Our three_smooth_model has a lower AIC score than our previous best model (two_smooth_model), which did not include Latitude.
. . .
This implies that Latitude is indeed an informative non-linear predictor of bioluminescence.
7 4. Interactions
7.1 GAM with interaction terms
There are two ways to include interactions between variables:
for two smoothed variables :
s(x1, x2)for one smoothed variable and one linear variable (either factor or continuous): use the
byarguments(x1, by = x2)When
x2is a factor, you have a smooth term that vary between different levels ofx2When
x2is continuous, the linear effect ofx2varies smoothly withx1When
x2is a factor, the factor needs to be added as a main effect in the model
7.2 Interaction smoothed & factor variables
We will examine interaction effects to determine whether the non-linear smoother s(SampleDepth) varies across different levels of Season.
factor_interact <- gam(
Sources ~ Season +
s(SampleDepth, by = Season) +
s(RelativeDepth),
data = isit,
method = "REML"
)
summary(factor_interact)$s.table edf Ref.df F p-value
s(SampleDepth):Season1 6.839386 7.552045 95.119422 0
s(SampleDepth):Season2 8.744574 8.966290 154.474325 0
s(RelativeDepth) 6.987223 8.055898 6.821074 0
7.3 Interaction smoothed & factor variables
The first two panels show the interaction effect of the SampleDepth smooth and each level of our factor variable, Season.
A good question to ask participants: Do you see a difference between the two smooth curves?
The plots show some differences between the shape of the smooth terms among the two levels of Season. The most notable difference is the peak in the second panel, which tells us that there is an effect of SampleDepth between 1000 and 2000 that is important in Season 2, but does not occur in Season 1.
This hints that the interaction effect could be important to include in our model.
7.4 Interaction smoothed & factor variables
We can also visualise our model in 3D using vis.gam().
vis.gam(factor_interact, theta = 120, n.grid = 50, lwd = .4)We can change the rotation of this plot using the theta argument.
When we plot the interaction terms, we saw differences in the shape of the SampleDepth smooth among the two levels of Season. This hints that the interaction effect could be important to include in our model.
7.5 Interaction smoothed & factor variables
These plots hint that the interaction effect could be important to include in our model.
We will perform a model comparison using AIC to determine whether the interaction term improves our model’s performance.
AIC(two_smooth_model, factor_interact) df AIC
two_smooth_model 20.46960 5056.841
factor_interact 26.99693 4878.631
The AIC of our model with a factor interaction between the SampleDepth smooth and Season has a lower AIC score.
. . .
Together, the AIC and the plots confirm that including this interaction improves our model’s performance.
7.6 Interaction: two smoothed variables
Next, we’ll look at the interactions between two smoothed terms, SampleDepth and RelativeDepth.
smooth_interact <- gam(
Sources ~ Season + s(SampleDepth, RelativeDepth),
data = isit,
method = "REML"
)
summary(smooth_interact)$s.table edf Ref.df F p-value
s(SampleDepth,RelativeDepth) 27.12521 28.77 93.91722 0
7.7 Interaction: two smoothed variables
7.8 Interaction: two smoothed variables
vis.gam(
smooth_interact,
view = c("SampleDepth", "RelativeDepth"),
theta = 50,
n.grid = 50,
lwd = .4
)We can see a non-linear interaction clearly, where Sources is lowest at high values of SampleDepth and RelativeDepth, but increases with RelativeDepth while SampleDepth is low.
Remember, this plot can be rotated by changing the value of the theta argument. You can change the colour of the 3D plot using the color argument. Try specifying color = "cm" in vis.gam() above!
7.9 Interaction: two smoothed variables
So, we can see an interaction effect between these smooth terms in these plots.
Does including the interaction between s(SampleDepth) and s(RelativeDepth) improve our two_smooth_model model?
AIC(two_smooth_model, smooth_interact) df AIC
two_smooth_model 20.46960 5056.841
smooth_interact 30.33625 4943.890
. . .
The model with the interaction between s(SampleDepth) and s(RelativeDepth) has a lower AIC.
This means that the non-linear interaction improves our model’s performance, and our ability to understand the drivers of bioluminescence!
8 5. Generalization of the additive model
8.1 Generalization of the additive model
The basic additive model can be extended in the following ways:
- Using other distributions for the response variable with the
familyargument (just as in a GLM), - Using different kinds of basis functions,
- Using different kinds of random effects to fit mixed effect models.
We will now go over these aspects.
8.2 Generalized additive models
We have so far worked with simple (Gaussian) additive models, the non-linear equivalent to a linear model.
. . .
But, what can we do if: - the observations of the response variable do not follow a normal distribution? - the variance is not constant? (heteroscedasticity)
. . .
These cases are very common!
Just like generalized linear models (GLM), we can formulate generalized additive models to deal with these issues.
8.3 Generalized additive models
Let us return to our smooth interaction model for the bioluminescence data:
smooth_interact <- gam(
Sources ~ Season + s(SampleDepth, RelativeDepth),
data = isit,
method = "REML"
)
summary(smooth_interact)$p.table Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.077356 0.4235432 19.070912 1.475953e-66
Season2 4.720806 0.6559436 7.196969 1.480113e-12
summary(smooth_interact)$s.table edf Ref.df F p-value
s(SampleDepth,RelativeDepth) 27.12521 28.77 93.91722 0
9 6. GAM validation
9.1 GAM validation
As with a GLM, it is essential to check whether the model is correctly specified, especially in regards to the distribution of the response variable.
We need to verify:
- The choice of basis dimension
k. - The residuals plots (just as for a GLM).
- The concurvity (equivalent of collinearity for smooth terms)
. . .
Useful functions included in mgcv:
-
k.check()performs a basis dimension check. -
gam.check()produces residual plots (and also callsk.check()). -
concurvity()produces summary measures of concurvity between gam components.
9.2 Selecting \(k\) basis dimensions
Remember the smoothing parameter \(\lambda\), which constrains the wiggliness of our smoothing functions?
This wiggliness is further controlled by the basis dimension \(k\), which sets the number of basis functions used to create a smooth function.
The more basis functions used to build a smooth function, the more wiggly the smooth:
In previous sections, we discussed the role of the smoothing parameter \(\lambda\) in controlling the wiggliness of our smoothing functions. This wiggliness is further controlled by the basis dimension \(k\), which sets the number of basis functions used to create a smooth function.
Each smooth in a GAM essentially the weighted sum of many smaller functions, called basis functions. The more basis functions used to build a smooth function, the more wiggly the smooth. As you can see below, a smooth with a small \(k\) basis dimension will be less wiggly than a smooth with a high \(k\) basis dimension.
9.3 Selecting \(k\) basis dimensions
The key to getting a good model fit is therefore to balance the trade-off between two things:
- The smoothing parameter \(\lambda\), which penalizes wiggliness;
- The basis dimension \(k\), which allows the model to wiggle according to our data.
Have we optimized the tradeoff between smoothness ( \(\lambda\) ) and wiggliness ( \(k\) ) in our model?
In other words, have we chosen a k that is large enough?
k.check(smooth_interact) k' edf k-index p-value
s(SampleDepth,RelativeDepth) 29 27.12521 0.9448883 0.08
. . .
The EDF are very close to k, which means the wiggliness of the model is being overly constrained. .alert[The tradeoff between smoothness and wiggliness is not balanced].
Here we essentially ask: is our model wiggly enough?
The EDF are very close to k. This means the wiggliness of the model is being overly constrained by the default k, and could fit the data better with greater wiggliness. In other words, the tradeoff between smoothness and wiggliness is not balanced.
9.4 Selecting \(k\) basis dimensions
We can refit the model with a larger k:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.129512 0.4177659 19.459491 1.911267e-68
Season2 4.629964 0.6512205 7.109672 2.741865e-12
edf Ref.df F p-value
s(SampleDepth,RelativeDepth) 46.03868 54.21371 55.19817 0
. . .
Is k large enough this time?
k.check(smooth_interact_k60) k' edf k-index p-value
s(SampleDepth,RelativeDepth) 59 46.03868 1.048626 0.8825
. . .
The EDF are much smaller than k. We can replace our previous model with this wigglier version:
smooth_interact <- smooth_interact_k60The EDF are much smaller than k, which means this model fits the data better with additional wiggliness. We can replace our previous model with this wigglier version:
9.5 Choosing a distribution
As with any Normal model, we must check some model assumptions before continuing.
We can look at the residual plots with gam.check():
In addition to the plots, gam.check() also provides the output of k.check().
Recall: We can evaluate the distribution of the model residuals to verify these assumptions, just as we would do for a GLM (see class 6).
9.6 Choosing a distribution
. . .
Pronounced heteroscedasticity and some strong outliers in the residuals
These plots are a little different than those produced by
plotfor a linear model (e.g. no leverage plot)Participants should already be familiar with residual plots (they are explaned more in detail in class 4 and class 6.
These residual plots highlight some problems: - Panel 2: The variance of the error is not constant (heteroscedasticity). - Panels 1 and 4: There are a few strong outlier patterns in this dataset.
9.7 Choosing a distribution
For our interaction model, we need a probability distribution that allows the variance to increase with the mean.
. . .
One family of distributions that has this property and that works well in a GAM is the Tweedie family.
A common link function for Tweedie distributions is the \(log\).
. . .
As in a GLM, we can use the family = argument in gam() to fit models with other distributions (including distributions such as binomial, poisson, gamma etc.).
To get an overview of families available in mgcv:
?family.mgcv10 Exercise 3
10.1 Exercise 3
- Fit a new model
smooth_interact_twwith the same formula as thesmooth_interactmodel but with a distribution from the Tweedie family (instead of the normal distribution) andloglink function. You can do so by usingfamily = tw(link = "log")insidegam(). - Check the choice of
kand the residual plots for the new model. - Compare
smooth_interact_twwithsmooth_interact. Which one would you choose?
. . .
Hint:
# Here is how we would write the model to specify the Normal distribution:
smooth_interact <- gam(
Sources ~ Season + s(SampleDepth, RelativeDepth, k = 60),
family = gaussian(link = "identity"),
data = isit,
method = "REML"
)10.2 Exercise 3 - Solution
1. First, let us fit a new model with the Tweedie distribution and a log link function.
smooth_interact_tw <- gam(
Sources ~ Season + s(SampleDepth, RelativeDepth, k = 60),
family = tw(link = "log"),
data = isit,
method = "REML"
)
summary(smooth_interact_tw)$p.table Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3126641 0.03400390 38.60333 8.446478e-180
Season2 0.5350529 0.04837342 11.06089 1.961733e-26
summary(smooth_interact_tw)$s.table edf Ref.df F p-value
s(SampleDepth,RelativeDepth) 43.23949 51.57139 116.9236 0
10.3 Exercise 3 - Solution
2. Check the choice of k and the residual plots for the new model.
Next, we should check the basis dimension:
k.check(smooth_interact_tw) k' edf k-index p-value
s(SampleDepth,RelativeDepth) 59 43.23949 1.015062 0.8125
The trade-off between \(k\) and EDF is well balanced. Great!
10.4 Exercise 3 - Solution
We should also verify the residual plots, to validate the Tweedie distribution:
The residual plots do look much better, but it is clear that something is missing from the model. This could be a spatial affect (longtitude and latitude), or a random effect (e.g. based on Station).
10.5 Exercise 3 - Solution
3. Compare smooth_interact_tw with smooth_interact. Which one would you choose?
AIC(smooth_interact, smooth_interact_tw) df AIC
smooth_interact 49.47221 4900.567
smooth_interact_tw 47.86913 3498.490
AIC allows us to compare models that are based on different distributions!
. . .
Using a Tweedie instead of a Normal distribution greatly improves our model!
The AIC score for smooth_interact_tw is much smaller than the AIC score for the smooth_interact. Using a Tweedie instead of a Normal distribution greatly improves our model!
11 6. Changing the basis
11.1 Other smooth functions
To model a non-linear smooth variable or surface, smooth functions can be built in different ways:
s() → for modelling a 1-dimensional smooth or for modeling interactions among variables measured using the same unit and the same scale
te() → for modelling 2- or n-dimensional interaction surfaces of variables that are not on the same scale. Includes main effects.
ti() → for modelling 2- or n-dimensional interaction surfaces that do not include the main effects.
11.2 Parameters of smooth functions
The smooth functions have several parameters that can be set to change their behaviour. The most common parameters are :
k → basis dimension - determines the upper bound of the number of base functions used to build the curve. - constrains the wigglyness of a smooth. - k should be < the number of unique data points. - the complexity (i.e. non-linearity) of a smooth function in a fitted model is reflected by its effective degrees of freedom (EDF).
11.3 Parameters of smooth functions
The smooth functions have several parameters that can be set to change their behaviour. The most often-used parameters are:
d → specifies that predictors in the interaction are on the same scale or dimension (only used in te() and ti()). - For example, in te(Time, width, height, d=c(1,2)), indicates that width and height are one the same scale, but not Time.
bs → specifies the type of basis functions. - the default for s() is tp (thin plate regression spline) and for te() and ti() is cr (cubic regression spline).
11.4 Example: Cyclical data
Let’s use a time series of climate data, with monthly measurements, to find a temporal trend in yearly temperature. We’ll use the Nottingham temperature time series for this, which is included in R:
data(nottem)Let’s plot the monthly temperature fluctuations for every year in the nottem dataset:
# the number of years of data (20 years)
n_years <- length(nottem) / 12
# categorical variable coding for the 12 months of the year, for every
# year sampled (so, a sequence 1 to 12 repeated for 20 years).
nottem_month <- rep(1:12, times = n_years)
# the year corresponding to each month in nottem_month
nottem_year <- rep(1920:(1920 + n_years - 1), each = 12)See ?nottem for a more complete description of this dataset.
Using the nottem data, we have created three new vectors: - n_years corresponds to the number of years of data (20 years) - nottem_month is a categorical variable coding for the 12 months of the year, for every year sampled (so, a sequence 1 to 12 repeated for 20 years). - nottem_year is a variable containing the year corresponding to each month in nottem_month.
11.5 Example: Cyclical data
Using the nottem data, we have created three new vectors: - n_years corresponds to the number of years of data (20 years) - nottem_month is a categorical variable coding for the 12 months of the year, for every year sampled (so, a sequence 1 to 12 repeated for 20 years). - nottem_year is a variable containing the year corresponding to each month in nottem_month.
11.6 Example: Cyclical data
We can model both the cyclic change of temperature across months and the non-linear trend through years, using a cyclical cubic spline, or cc, for the month variable and a regular smooth for the year variable.
year_gam <- gam(
nottem ~ s(nottem_year) + s(nottem_month, bs = "cc"),
method = "REML"
)
summary(year_gam)$s.table edf Ref.df F p-value
s(nottem_year) 1.621375 2.011475 2.850888 0.06141004
s(nottem_month) 6.855132 8.000000 393.119285 0.00000000
11.7 Example: Cyclical data
plot(year_gam, page = 1, scale = 0)There is about 1-1.5 degree rise in temperature over the period, but within a given year there is about 20 degrees variation in temperature, on average. The actual data vary around these values and that is the unexplained variance.
Here we can see one of the very interesting bonuses of using GAMs. We can either plot the response surface (fitted values) or the terms (contribution of each covariate) as shown here. You can imagine these as plots of the changing regression coefficients, and how their contribution (or effect size) varies over time. In the first plot, we see that positive contributions of temperature occurred post-1930.
12 7. Quick introduction to GAMMs
12.1 Dealing with non-independence
When observations are not independent, GAMs can be used to either incorporate:
- a correlation structure to model autocorrelated residuals, such as:
- the autoregressive (AR) model
- the moving average model (MA); or,
- a combination of both models (ARMA).
- random effects that model independence between observations at the same site.- random effects that model independence among observations from the same site.
12.2 Autocorrelation of residuals
Autocorrelation of residuals refers to the degree of correlation between the residuals (the differences between actual and predicted values) in a time series model.
In other words, if there is autocorrelation of residuals in a time series model, it means that there is a pattern or relationship between the residuals at one time and the residuals at other times.
. . .
Autocorrelation of residuals is usually measured using the ACF (autocorrelation function) and pACF (partial autocorrelation function) graphs, which show the correlation between residuals at different lags.
If the ACF or pACF graphs show significant correlations at non-zero lags, there is evidence of autocorrelation in the residuals and the model may need to be modified or improved to better capture the underlying patterns in the data.
. . .
Let’s see how this works with our year_gam model!
12.7 Random effects
As we saw in the section about changing the basis, bs specifies the type of underlying base function. For random intercepts and linear random slopes we use bs = "re", but for random smooths we use bs = "fs".
. . .
3 different types of random effects in GAMMs (fac → factor coding for the random effect; x0 → continuous fixed effect):
-
random intercepts adjust the height of other model terms with a constant value:
s(fac, bs = "re") -
random slopes adjust the slope of the trend of a numeric predictor:
s(fac, x0, bs = "re") -
random smooths adjust the trend of a numeric predictor in a nonlinear way:
s(x0, fac, bs = "fs", m = 1), where the argument m=1 sets a heavier penalty for the smooth moving away from 0, causing shrinkage to the mean.
12.8 GAMM with a random intercept
We will use the gamSim() function to generate a dataset with a random effect, then run a model with a random intercept using fac as the random factor.
gam_data2 <- gamSim(eg = 6)4 term additive + random effectGu & Wahba 4 term additive model
# run random intercept model
gamm_intercept <- gam(
y ~ s(x0) + s(fac, bs = "re"),
data = gam_data2,
method = "REML"
)
# examine model output
summary(gamm_intercept)$s.table edf Ref.df F p-value
s(x0) 2.517203 3.141975 1.985496 0.1103459
s(fac) 2.969516 3.000000 97.518258 0.0000000
12.9 GAMM with a random intercept
plot(gamm_intercept, select = 2)12.10 GAMM with a random intercept
We can plot the summed effects for the x0 without random effects, and then plot the predictions of all 4 levels of the random fac effect:
fac1 fac2 fac3 fac4
12.11 GAMM with a random slope
Next, we will run and plot a model with a random slope:
gamm_slope <- gam(
y ~ s(x0) + s(x0, fac, bs = "re"),
data = gam_data2,
method = "REML"
)summary(gamm_slope)$s.table edf Ref.df F p-value
s(x0) 2.668737 3.329694 1.730252 0.1379086
s(x0,fac) 2.951988 3.000000 62.210584 0.0000000
12.12 GAMM with a random slope
12.13 GAMM with a random intercept and slope
gamm_int_slope <- gam(
y ~ s(x0) + s(fac, bs = "re") + s(fac, x0, bs = "re"),
data = gam_data2,
method = "REML"
)summary(gamm_int_slope)$s.table edf Ref.df F p-value
s(x0) 2.5172027233 3.141973 1.985541e+00 0.1103390
s(fac) 2.9694888330 3.000000 9.760135e+01 0.0000000
s(fac,x0) 0.0009910603 3.000000 3.353325e-04 0.4169633
12.14 GAMM with a random intercept and slope
12.15 GAMM with a random smooth
Lastly, we will examine a model with a random smooth.
gamm_smooth <- gam(
y ~ s(x0) + s(x0, fac, bs = "fs", m = 1),
data = gam_data2,
method = "REML"
)
summary(gamm_smooth)$s.table edf Ref.df F p-value
s(x0) 2.462632 3.051219 1.794069 0.1451071
s(x0,fac) 4.709845 35.000000 8.460955 0.0000000
12.16 GAMM with a random smooth
plot(gamm_smooth, select = 1)select = 1 because the smooth slope appears as the first entry in your summary table.
12.17 GAMM with a random smooth
Here, if the random slope varied along x0, we would see different curves for each level.
12.18 GAMM
All of the mixed models from this section can be compared using AIC() to determine the best fit model
AIC(gamm_intercept, gamm_slope, gamm_int_slope, gamm_smooth) df AIC
gamm_intercept 8.141635 2205.121
gamm_slope 8.328850 2271.726
gamm_int_slope 8.143560 2205.123
gamm_smooth 11.150854 2207.386
The best model among those we have built here would be a GAMM with a random effect on the intercept.
13 Final info
13.1 Resources
This class was a brief introduction to basic concepts, and popular packages to help you estimate, evaluate, and visualise GAMs in R, but there is so much more to explore!
* The book Generalized Additive Models: An Introduction with R by Simon Wood (the author of the mgcv package). * Simon Wood’s website, maths.ed.ac.uk/~swood34/. * Gavin Simpson’s blog, From the bottom of the heap. * Gavin Simpson’s package gratia for GAM visualisation in ggplot2. * Generalized Additive Models: An Introduction with R course by Noam Ross. * Overview GAMM analysis of time series data tutorial by Jacolien van Rij. * Hierarchical generalized additive models in ecology: an introduction with mgcv by Pedersen et al. (2019).
Finally, the help pages, available through ?gam in R, are an excellent resource.
Those slides were adapted from the QCBS workshop series https://r.qcbs.ca/workshops/
Note that the model
y_obs~s(x)gives exactly the same results asy_obs~s(x)+x. We used the s(x)+x to illustrate the nestedness of the model, but the +x can be omitted.