Some are made on the residuals and others on the independent variables. None are made on the (unconditionned) dependent variable.
Residuals are assumed to:
have a mean of zero
be independent
be normally distributed
be homoscedastic
Independent variables are assumed to:
have a linear relation with Y
be measured without error
to be independent from each other
Maximum likelihood
Technique used for estimating the parameters of a given distribution, using some observed data
For Example:
Population is known to follow a “normal distribution” but “mean” and “variance” are unknown, MLE can be used to estimate them using a limited sample of the population.
Likelihood vs probability
We maximize the likelihood and make inferences on the probability
Likelihood
\[
L(parameters | data)
\]
How likely it is to get those parameters given the data.
Probability
\[
P(data | null\ parameters)
\]
Probability to get the data given the null parameters. Or how probable it is to get those data according to the null model.
GLM expresses the transformed conditional expectation of the dependent variable Y as a linear combination of the regression variables X
Model has 3 components
a dependent variable Y with a response distribution to model it: Gaussian, Binomial, Bernouilli, Poisson, negative binomial, zero-inflated …, zero-truncated …, …
m1 <-glm(reproduction ~ age, data = mouflon, family = binomial)summary(m1)
Call:
glm(formula = reproduction ~ age, family = binomial, data = mouflon)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.19921 0.25417 12.59 <2e-16 ***
age -0.36685 0.03287 -11.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 928.86 on 715 degrees of freedom
Residual deviance: 767.51 on 714 degrees of freedom
(4 observations deleted due to missingness)
AIC: 771.51
Number of Fisher Scoring iterations: 4
gala <-read.csv("data/gala.csv")plot(Species ~ Area, gala)
plot(Species ~log(Area), gala)
hist(gala$Species)
modpl <-glm(Species ~ Area + Elevation + Nearest, family = poisson, gala)summary(modpl)
Call:
glm(formula = Species ~ Area + Elevation + Nearest, family = poisson,
data = gala)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.548e+00 3.933e-02 90.211 < 2e-16 ***
Area -5.529e-05 1.890e-05 -2.925 0.00344 **
Elevation 1.588e-03 5.040e-05 31.502 < 2e-16 ***
Nearest 5.921e-03 1.466e-03 4.039 5.38e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3510.7 on 29 degrees of freedom
Residual deviance: 1797.8 on 26 degrees of freedom
AIC: 1966.7
Number of Fisher Scoring iterations: 5
res <-simulateResiduals(modpl)testDispersion(res)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 110.32, p-value < 2.2e-16
alternative hypothesis: two.sided
DHARMa zero-inflation test via comparison to expected zeros with
simulation under H0 = fitted model
data: simulationOutput
ratioObsSim = NaN, p-value = 1
alternative hypothesis: two.sided