Extended Statistics In R

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 10, 2025

Beyond Linear Regression

Linear Regression relies on Normally Distributed residuals
- We can estimate points and data will be normally distributed around the estimates
This relies on the mean (\(\mu\)) and standard deviation (\(\sigma\)) being independent
- And on data being continuous

Some data is not normally distributed
- Discrete data
- Probabilities and proportions

Generalised Linear Models

We often use a generalisation of linear modelling to fit these models
- Formally known as Generalised Linear Models (GLMs)
Cannot be fit using Least Squares
- Parameter estimates by maximising likelihood functions
- Use iterative solutions to converge
Always have an underlying distribution used for parameter estimation
- Poisson Distribution: Log-linear Model
- Binomial Distribution: Logistic Regression

Logistic Regression

Logistic regression estimates the probablity (\(\pi\)) of an event occurring
- A binary outcome \(\implies\) Success Vs Failure
Can also be considered a classification problem
Heavily used in machine learning

Probabilites are bound at [0, 1]
- makes fitting challenging near boundary points
Probabilities are transformed using the logit transformation
- Transforms values on [0, 1] to \([-\infty, \infty]\)

The Logit Transformation

\[ \text{logit}(\pi) = \log \frac{\pi}{1-\pi} \]

\(\pi = 0.5\) \(\rightarrow\) logit \((\pi) = \log \frac{0.5}{0.5} = 0\)
\(\pi = 0\) \(\rightarrow\) logit \((\pi) = \log \frac{0}{1} = -\infty\)
\(\pi = 1\) \(\rightarrow\) logit \((\pi) = \log \frac{1}{0} = \infty\)

The Logit Transformation

Parameters from logistic regression estimate \(\text{logit}(\pi)\)

Estimates near zero \(\implies \pi = 0.5\)
Estimates above zero \(\implies \pi > 0.5\)
Estimates below zero \(\implies \pi < 0.5\)

An Example Of Logistic Regression

A perfect example for logistic regression \(\implies\) survivors of the Titanic!
Available in the package carData
- If you don’t have this installed: install.packages(c("car", "carData"))
- The package car is the Companion to Applied Regression (not broom brooms)

Start a new R session and new R script

library(tidyverse)
library(carData)
theme_set(theme_bw())

The Titanic Survivors

Let’s tidy up the data first

titanic_tbl <- TitanicSurvival |>
  as_tibble(rownames = "name") |> # Tibbles are nice. Keep the passenger names
  dplyr::filter(if_all(everything(), \(x) !is.na(x))) # Discard incomplete records

The Titanic Survivors

And a quick preview using only the categorical variables

titanic_tbl |>
  count(survived, sex, passengerClass) |>
  pivot_wider(names_from = survived, values_from = n) |>
  mutate(
    prob = yes / (no + yes)
  )

# A tibble: 6 × 5
  sex    passengerClass    no   yes  prob
  <fct>  <fct>          <int> <int> <dbl>
1 female 1st                5   128 0.962
2 female 2nd               11    92 0.893
3 female 3rd               80    72 0.474
4 male   1st               98    53 0.351
5 male   2nd              135    23 0.146
6 male   3rd              290    59 0.169

It looks rather like sex and class might play a role

The Titanic Survivors

For logistic regression, we need to specify the underlying distribution
\(\implies\) family = binomial
This tells glm we are fitting binomial probabilities

## Fit a model using sex, age and passengerClass to predict survival probability
titanic_glm <- glm(
  survived ~ sex + age + passengerClass, data = TitanicSurvival, family = binomial
)

The Titanic Survivors

summary(titanic_glm)


Call:
glm(formula = survived ~ sex + age + passengerClass, family = binomial, 
    data = TitanicSurvival)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        3.522074   0.326702  10.781  < 2e-16 ***
sexmale           -2.497845   0.166037 -15.044  < 2e-16 ***
age               -0.034393   0.006331  -5.433 5.56e-08 ***
passengerClass2nd -1.280570   0.225538  -5.678 1.36e-08 ***
passengerClass3rd -2.289661   0.225802 -10.140  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1414.62  on 1045  degrees of freedom
Residual deviance:  982.45  on 1041  degrees of freedom
  (263 observations deleted due to missingness)
AIC: 992.45

Number of Fisher Scoring iterations: 4

How would we interpret these coefficients?
What does the Intercept represent?

The Titanic Survivors

For a 50yo woman in 1st Class

x <- c(1, 0, 50, 0, 0)
logit_pi <- sum(x * coef(titanic_glm))
logit_pi

[1] 1.802412

This is > 0 \(\implies\) >50% chance of survival

inv_logit <- \(x) {exp(x) / (1 + exp(x))} # Convert back to probabilities
inv_logit(logit_pi) # The actual probability of survival

[1] 0.8584423

Making Predictions

We could obtain a probability for each passenger
- Might be considered to be training a model

titanic_tbl |>
  ## Setting type = "response" will transform back to [0,1]
  mutate(prob_survival = predict(titanic_glm, type = "response")) |>
  arrange(desc(prob_survival))

# A tibble: 1,046 × 6
   name                        survived sex     age passengerClass prob_survival
   <chr>                       <fct>    <fct> <dbl> <fct>                  <dbl>
 1 Allison, Miss. Helen Lorai… no       fema…     2 1st                    0.969
 2 Carter, Miss. Lucile Polk   yes      fema…    14 1st                    0.954
 3 Madill, Miss. Georgette Al… yes      fema…    15 1st                    0.953
 4 Hippach, Miss. Jean Gertru… yes      fema…    16 1st                    0.951
 5 Lines, Miss. Mary Conover   yes      fema…    16 1st                    0.951
 6 Maioni, Miss. Roberta       yes      fema…    16 1st                    0.951
 7 Dick, Mrs. Albert Adrian (… yes      fema…    17 1st                    0.950
 8 Penasco y Castellana, Mrs.… yes      fema…    17 1st                    0.950
 9 Astor, Mrs. John Jacob (Ma… yes      fema…    18 1st                    0.948
10 Marvin, Mrs. Daniel Warner… yes      fema…    18 1st                    0.948
# ℹ 1,036 more rows

Checking Predictions

If building a predictive model we often use an ROC curve to check accuracy
- Order by probability then plot the True/False Positive Rates

titanic_tbl |>
  mutate(prob_survival = predict(titanic_glm, type = "response")) |>
  arrange(desc(prob_survival)) |>
  mutate(
    ## Calculate the True Positive Rate as we step through the predictions
    TPR = cumsum(survived == "yes") / sum(survived == "yes"),
    ## And the False Positive Rate as we step through in order
    FPR = cumsum(survived == "no") / sum(survived == "no")
  ) |>
  ggplot(aes(FPR, TPR)) +
  geom_line() +
  geom_abline(slope = 1)

Checking Predictions

Testing Multiple Models

list(
  base_model = glm(
    survived ~ sex + age + passengerClass, data = titanic_tbl, family = binomial
  ),
  interactions = glm(
    survived ~ (sex + age + passengerClass)^2, data = titanic_tbl, family = binomial 
  )
) |>
  lapply(\(x) mutate(titanic_tbl, pi = predict(x, type = "response"))) |>
  bind_rows(.id = "model") |>
  arrange(model, desc(pi)) |>
  mutate(
    TPR = cumsum(survived == "yes") / sum(survived == "yes"),
    FPR = cumsum(survived == "no") / sum(survived == "no"),
    .by = model
  ) |>
  ggplot(aes(FPR, TPR, colour = model)) +
  geom_line() +
  geom_abline(slope = 1)

Testing Multiple Models

A Brief Comment

If using logistic regression in this context:
- Usually break data into a training & test set
- Assess the performance on the test set after training
This forms the basis of some Machine Learning techniques

Poisson Regression

Another common type of GLM is Poisson regression
Fitting the rate of an event occurring (\(\lambda\))
- e.g. cars at an intersection per hour
- Rates are always > 0 \(\implies\) estimated on the log scale

Poisson distributions have a variance that strictly equals the rate
Overdispersed distributions allow for an inflated variance
- Simply add an overdispersion term to allow for this
- Very common in bioinformatics, e.g. RNA-Seq

Poisson Regression

InsectSprays provides counts of insects after applying insecticides A-F
The units used for counting were an equal size (but are undefined)
- We can assume fewer insects is better

InsectSprays |> 
  summarise(
    ave = mean(count), .by = spray
  )

  spray       ave
1     A 14.500000
2     B 15.333333
3     C  2.083333
4     D  4.916667
5     E  3.500000
6     F 16.666667

Poisson Regression

insect_glm <- glm(count ~ spray, data = InsectSprays, family = poisson)
summary(insect_glm)


Call:
glm(formula = count ~ spray, family = poisson, data = InsectSprays)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.67415    0.07581  35.274  < 2e-16 ***
sprayB       0.05588    0.10574   0.528    0.597    
sprayC      -1.94018    0.21389  -9.071  < 2e-16 ***
sprayD      -1.08152    0.15065  -7.179 7.03e-13 ***
sprayE      -1.42139    0.17192  -8.268  < 2e-16 ***
sprayF       0.13926    0.10367   1.343    0.179    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 409.041  on 71  degrees of freedom
Residual deviance:  98.329  on 66  degrees of freedom
AIC: 376.59

Number of Fisher Scoring iterations: 5

How do we interpret these coefficients?

Poisson Regression

Model checking for Poisson models
- Do residuals appear consistently distributed?

plot(
  fitted(insect_glm), 
  residuals(insect_glm, type = "pearson")
)
abline(h = 0, col = "red")

Poisson Regression

Model checking for Poisson models
- Is there evidence of overdispersion?

deviance(insect_glm) / df.residual(insect_glm)

[1] 1.489828

If this value is >>1 \(\implies\) overdispersion
Maybe try a quasipoisson instead of poisson?
- family = quasipoisson
- Will probably be more conservative

A Brief Comment

Early RNA-Seq analysis used Poisson models \(\implies\) counts / gene
May also be appropriate for matching sequence motifs across a set of DNA sequences

If units of counting unequal size \(\implies\) use an offset
- Counting trees in unevenly sized forests?
- Number of crimes in cities of different sized populations

Mixed-Effects Models

Mixed-effects models are very common within ecological research
- Need to understand the difference between fixed and random effects
Linear Regression fits fixed effects
- Assumes the estimated effect is effectively the same across experiments
- Reproducible

Random effects are the effects of a variable that is not fixed
- e.g. the effect of a particular site on a response variable
- The effect of a site is not fixed, it varies across sites
- We can model this variation using random effects

Mixed-Effects Models

How do we model random effects?

I learned these as nested models fitting within distinct groups defined by random variables
Could be the date of a delivery
\(\implies\) can be observed and modelled but not reproduced
- The Intercept term for each group (\(i\)) is \(\mu_i \sim \mathcal{N}(\mu, \sigma_{\mu})\)
- Response terms follow standard regression assumptions e.g. \(e_{ij} \sim \mathcal{N}(0, \sigma)\)
We have two components to the variance

Mixed-Effects Models

The syntax in R is super-clumsy
Most commonly, people use the package lme4 and the function lmer()
- Early mixed-effects models used nlme \(\implies\) no longer supported

library(lme4)
data(Machines, package = "nlme")

We are measuring productivity using machines A, B or C (fixed)
- Nested within each of 6 factory workers
- Each individual will have different productivity
  \(\implies\) measurable & modellable but random across the population

Mixed-Effects Models

To denote the nesting or random effect (1 | Worker)
- Machine is specified as per usual for fixed effects

machines_lmer <- lmer(score ~ Machine + (1 | Worker) , data = Machines) 
summary(machines_lmer)

Linear mixed model fit by REML ['lmerMod']
Formula: score ~ Machine + (1 | Worker)
   Data: Machines

REML criterion at convergence: 286.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7249 -0.5233  0.1328  0.6513  1.7559 

Random effects:
 Groups   Name        Variance Std.Dev.
 Worker   (Intercept) 26.487   5.147   
 Residual              9.996   3.162   
Number of obs: 54, groups:  Worker, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)   52.356      2.229  23.485
MachineB       7.967      1.054   7.559
MachineC      13.917      1.054  13.205

Correlation of Fixed Effects:
         (Intr) MachnB
MachineB -0.236       
MachineC -0.236  0.500

Mixed-Effects Models

This output looks very different
We have Random effects and Fixed Effects
No \(p\)-values!!!

The intercept gives the average score for Machine A
The random effects indicate how these scores vary across workers
We have an estimate and a standard error \(\implies t\)-statistic
- Degrees of freedom are controversial for mixed-effects models
- Can’t determine the \(t\)-distribution to use for \(p\)-values

Mixed-Effects Models

A compromise is to use lmerTest before fitting
Provides \(p\)-values
Discussion is beyond the scope of this course (& my knowledge)

library(lmerTest)
machines_lmer <- lmer(score ~ Machine + (1 | Worker) , data = Machines) 
summary(machines_lmer) |> coef()

             Estimate Std. Error        df  t value     Pr(>|t|)
(Intercept) 52.355556   2.229312  5.833185 23.48507 5.328296e-07
MachineB     7.966667   1.053882 46.000000  7.55935 1.329432e-09
MachineC    13.916667   1.053882 46.000000 13.20514 2.933231e-17

Generalised Mixed-Effects Models

GLMMs are also heavily used in ecology \(\implies\) require genuine understanding
- Very heavy on linear algebra
Ben Bolker is the undisputed GLMM heavyweight
- Has a comprehensive page describing these
lme4 has implemented glmer()
Additional GLMM fitting in the packages glmmADMB and glmmTMB
- Useful for challenging datasets
Doug Bates (nlme) was a member of R Core but has left the R community