library(tidyverse)
library(carData)
theme_set(theme_bw())
Extended Statistics In R
RAdelaide 2025
Beyond Linear Regression
Beyond Linear Regression
- Linear Regression relies on Normally Distributed residuals
- We can estimate points and data will be normally distributed around the estimates
- This relies on the mean (\(\mu\)) and standard deviation (\(\sigma\)) being independent
- And on data being continuous
- Some data is not normally distributed
- Discrete data
- Probabilities and proportions
Generalised Linear Models
- The underlying theory is well beyond the scope
- We often use a generalisation of linear modelling to fit these models
- Formally known as Generalised Linear Models (GLMs)
- Cannot be fit using Least Squares
- Parameter estimates by maximising likelihood functions
- Use iterative solutions to converge
- Always have an underlying distribution used for parameter estimation
- Poisson Distribution: Log-linear Model
- Binomial Distribution: Logistic Regression
Logistic Regression
- Logistic regression estimates the probablity (\(\pi\)) of an event occurring
- A binary outcome \(\implies\) Success Vs Failure
- Can also be considered a classification problem
- Heavily used in machine learning
- Probabilites are bound at [0, 1]
- makes fitting challenging near boundary points
- Probabilities are transformed using the logit transformation
- Transforms values on [0, 1] to \([-\infty, \infty]\)
The Logit Transformation
\[ \text{logit}(\pi) = \log \frac{\pi}{1-\pi} \]
- \(\pi = 0.5\) \(\rightarrow\) logit \((\pi) = \log \frac{0.5}{0.5} = 0\)
- \(\pi = 0\) \(\rightarrow\) logit \((\pi) = \log \frac{0}{1} = -\infty\)
- \(\pi = 1\) \(\rightarrow\) logit \((\pi) = \log \frac{1}{0} = \infty\)
An Example Of Logistic Regression
- A perfect example for logistic regression \(\implies\) survivors of the Titanic!
- Available in the package
carData
- If you don’t have this installed:
install.packages(c("car", "carData"))
- The package
car
is the Companion to Applied Regression (not broom brooms)
- If you don’t have this installed:
- Start a new R session and new R script
The Titanic Survivors
- Let’s tidy up the data first
<- TitanicSurvival |>
titanic_tbl as_tibble(rownames = "name") |> # Tibbles are nice. Keep the passenger names
::filter(if_all(everything(), \(x) !is.na(x))) # Discard incomplete records dplyr
Making Predictions
- We could obtain a probability for each passenger
- Might be considered to be training a model
|>
titanic_tbl ## Setting type = "response" will transform back to [0,1]
mutate(prob_survival = predict(titanic_glm, type = "response")) |>
arrange(desc(prob_survival))
# A tibble: 1,046 × 6
name survived sex age passengerClass prob_survival
<chr> <fct> <fct> <dbl> <fct> <dbl>
1 Allison, Miss. Helen Lorai… no fema… 2 1st 0.969
2 Carter, Miss. Lucile Polk yes fema… 14 1st 0.954
3 Madill, Miss. Georgette Al… yes fema… 15 1st 0.953
4 Hippach, Miss. Jean Gertru… yes fema… 16 1st 0.951
5 Lines, Miss. Mary Conover yes fema… 16 1st 0.951
6 Maioni, Miss. Roberta yes fema… 16 1st 0.951
7 Dick, Mrs. Albert Adrian (… yes fema… 17 1st 0.950
8 Penasco y Castellana, Mrs.… yes fema… 17 1st 0.950
9 Astor, Mrs. John Jacob (Ma… yes fema… 18 1st 0.948
10 Marvin, Mrs. Daniel Warner… yes fema… 18 1st 0.948
# ℹ 1,036 more rows
Checking Predictions
- If building a predictive model we often use an ROC curve to check accuracy
- Order by probability then plot the True/False Positive Rates
|>
titanic_tbl mutate(prob_survival = predict(titanic_glm, type = "response")) |>
arrange(desc(prob_survival)) |>
mutate(
## Calculate the True Positive Rate as we step through the predictions
TPR = cumsum(survived == "yes") / sum(survived == "yes"),
## And the False Positive Rate as we step through in order
FPR = cumsum(survived == "no") / sum(survived == "no")
|>
) ggplot(aes(FPR, TPR)) +
geom_line() +
geom_abline(slope = 1)
Testing Multiple Models
list(
base_model = glm(
~ sex + age + passengerClass, data = titanic_tbl, family = binomial
survived
),interactions = glm(
~ (sex + age + passengerClass)^2, data = titanic_tbl, family = binomial
survived
)|>
) lapply(\(x) mutate(titanic_tbl, pi = predict(x, type = "response"))) |>
bind_rows(.id = "model") |>
arrange(model, desc(pi)) |>
mutate(
TPR = cumsum(survived == "yes") / sum(survived == "yes"),
FPR = cumsum(survived == "no") / sum(survived == "no"),
.by = model
|>
) ggplot(aes(FPR, TPR, colour = model)) +
geom_line() +
geom_abline(slope = 1)
A Brief Comment
- If using logistic regression in this context:
- Usually break data into a training & test set
- Assess the performance on the test set after training
- This forms the basis of some Machine Learning techniques
Poisson Regression
- Quasipoisson is another parameterisation for overdispersed data
- NB has a strict quadratic relationship between mean and overdispersion
- Not so strict for Quasipoisson
- Another common type of GLM is Poisson regression
- Fitting the rate of an event occurring (\(\lambda\))
- e.g. cars at an intersection per hour
- Rates are always > 0 \(\implies\) estimated on the log scale
- Poisson distributions have a variance that strictly equals the rate
- Overdispersed distributions allow for an inflated variance
- Simply add an overdispersion term to allow for this
- Very common in bioinformatics, e.g. RNA-Seq
A Brief Comment
- Early RNA-Seq analysis used Poisson models \(\implies\) counts / gene
- May also be appropriate for matching sequence motifs across a set of DNA sequences
- If units of counting unequal size \(\implies\) use an
offset
- Counting trees in unevenly sized forests?
- Number of crimes in cities of different sized populations
Mixed-Effects Models
Mixed-Effects Models
- Mixed-effects models are very common within ecological research
- Need to understand the difference between fixed and random effects
- Linear Regression fits fixed effects
- Assumes the estimated effect is effectively the same across experiments
- Reproducible
- Random effects are the effects of a variable that is not fixed
- e.g. the effect of a particular site on a response variable
- The effect of a site is not fixed, it varies across sites
- We can model this variation using random effects
Generalised Mixed-Effects Models
- GLMMs are also heavily used in ecology \(\implies\) require genuine understanding
- Very heavy on linear algebra
- Ben Bolker is the undisputed GLMM heavyweight
lme4
has implementedglmer()
- Additional GLMM fitting in the packages
glmmADMB
andglmmTMB
- Useful for challenging datasets
- Doug Bates (
nlme
) was a member of R Core but has left theR
community