Extended Statistics In R

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 10, 2025

Beyond Linear Regression

Beyond Linear Regression

  • Linear Regression relies on Normally Distributed residuals
    • We can estimate points and data will be normally distributed around the estimates
  • This relies on the mean (\(\mu\)) and standard deviation (\(\sigma\)) being independent
    • And on data being continuous
  • Some data is not normally distributed
    • Discrete data
    • Probabilities and proportions

Generalised Linear Models

  • We often use a generalisation of linear modelling to fit these models
    • Formally known as Generalised Linear Models (GLMs)
  • Cannot be fit using Least Squares
    • Parameter estimates by maximising likelihood functions
    • Use iterative solutions to converge
  • Always have an underlying distribution used for parameter estimation
    • Poisson Distribution: Log-linear Model
    • Binomial Distribution: Logistic Regression

Logistic Regression

  • Logistic regression estimates the probablity (\(\pi\)) of an event occurring
    • A binary outcome \(\implies\) Success Vs Failure
  • Can also be considered a classification problem
  • Heavily used in machine learning
  • Probabilites are bound at [0, 1]
    • makes fitting challenging near boundary points
  • Probabilities are transformed using the logit transformation
    • Transforms values on [0, 1] to \([-\infty, \infty]\)

The Logit Transformation

\[ \text{logit}(\pi) = \log \frac{\pi}{1-\pi} \]

  • \(\pi = 0.5\) \(\rightarrow\) logit \((\pi) = \log \frac{0.5}{0.5} = 0\)
  • \(\pi = 0\) \(\rightarrow\) logit \((\pi) = \log \frac{0}{1} = -\infty\)
  • \(\pi = 1\) \(\rightarrow\) logit \((\pi) = \log \frac{1}{0} = \infty\)

The Logit Transformation

  • Parameters from logistic regression estimate \(\text{logit}(\pi)\)
  • Estimates near zero \(\implies \pi = 0.5\)
  • Estimates above zero \(\implies \pi > 0.5\)
  • Estimates below zero \(\implies \pi < 0.5\)

An Example Of Logistic Regression

  • A perfect example for logistic regression \(\implies\) survivors of the Titanic!
  • Available in the package carData
    • If you don’t have this installed: install.packages(c("car", "carData"))
    • The package car is the Companion to Applied Regression (not broom brooms)
  • Start a new R session and new R script
library(tidyverse)
library(carData)
theme_set(theme_bw())

The Titanic Survivors

  • Let’s tidy up the data first
titanic_tbl <- TitanicSurvival |>
  as_tibble(rownames = "name") |> # Tibbles are nice. Keep the passenger names
  dplyr::filter(if_all(everything(), \(x) !is.na(x))) # Discard incomplete records

The Titanic Survivors

  • And a quick preview using only the categorical variables
titanic_tbl |>
  count(survived, sex, passengerClass) |>
  pivot_wider(names_from = survived, values_from = n) |>
  mutate(
    prob = yes / (no + yes)
  )
# A tibble: 6 × 5
  sex    passengerClass    no   yes  prob
  <fct>  <fct>          <int> <int> <dbl>
1 female 1st                5   128 0.962
2 female 2nd               11    92 0.893
3 female 3rd               80    72 0.474
4 male   1st               98    53 0.351
5 male   2nd              135    23 0.146
6 male   3rd              290    59 0.169
  • It looks rather like sex and class might play a role

The Titanic Survivors

  • For logistic regression, we need to specify the underlying distribution
    \(\implies\) family = binomial
  • This tells glm we are fitting binomial probabilities
## Fit a model using sex, age and passengerClass to predict survival probability
titanic_glm <- glm(
  survived ~ sex + age + passengerClass, data = TitanicSurvival, family = binomial
)

The Titanic Survivors

summary(titanic_glm)

Call:
glm(formula = survived ~ sex + age + passengerClass, family = binomial, 
    data = TitanicSurvival)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        3.522074   0.326702  10.781  < 2e-16 ***
sexmale           -2.497845   0.166037 -15.044  < 2e-16 ***
age               -0.034393   0.006331  -5.433 5.56e-08 ***
passengerClass2nd -1.280570   0.225538  -5.678 1.36e-08 ***
passengerClass3rd -2.289661   0.225802 -10.140  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1414.62  on 1045  degrees of freedom
Residual deviance:  982.45  on 1041  degrees of freedom
  (263 observations deleted due to missingness)
AIC: 992.45

Number of Fisher Scoring iterations: 4
  • How would we interpret these coefficients?
  • What does the Intercept represent?

The Titanic Survivors

  • For a 50yo woman in 1st Class
x <- c(1, 0, 50, 0, 0)
logit_pi <- sum(x * coef(titanic_glm))
logit_pi
[1] 1.802412
  • This is > 0 \(\implies\) >50% chance of survival
inv_logit <- \(x) {exp(x) / (1 + exp(x))} # Convert back to probabilities
inv_logit(logit_pi) # The actual probability of survival
[1] 0.8584423

Making Predictions

  • We could obtain a probability for each passenger
    • Might be considered to be training a model
titanic_tbl |>
  ## Setting type = "response" will transform back to [0,1]
  mutate(prob_survival = predict(titanic_glm, type = "response")) |>
  arrange(desc(prob_survival))
# A tibble: 1,046 × 6
   name                        survived sex     age passengerClass prob_survival
   <chr>                       <fct>    <fct> <dbl> <fct>                  <dbl>
 1 Allison, Miss. Helen Lorai… no       fema…     2 1st                    0.969
 2 Carter, Miss. Lucile Polk   yes      fema…    14 1st                    0.954
 3 Madill, Miss. Georgette Al… yes      fema…    15 1st                    0.953
 4 Hippach, Miss. Jean Gertru… yes      fema…    16 1st                    0.951
 5 Lines, Miss. Mary Conover   yes      fema…    16 1st                    0.951
 6 Maioni, Miss. Roberta       yes      fema…    16 1st                    0.951
 7 Dick, Mrs. Albert Adrian (… yes      fema…    17 1st                    0.950
 8 Penasco y Castellana, Mrs.… yes      fema…    17 1st                    0.950
 9 Astor, Mrs. John Jacob (Ma… yes      fema…    18 1st                    0.948
10 Marvin, Mrs. Daniel Warner… yes      fema…    18 1st                    0.948
# ℹ 1,036 more rows

Checking Predictions

  • If building a predictive model we often use an ROC curve to check accuracy
    • Order by probability then plot the True/False Positive Rates
titanic_tbl |>
  mutate(prob_survival = predict(titanic_glm, type = "response")) |>
  arrange(desc(prob_survival)) |>
  mutate(
    ## Calculate the True Positive Rate as we step through the predictions
    TPR = cumsum(survived == "yes") / sum(survived == "yes"),
    ## And the False Positive Rate as we step through in order
    FPR = cumsum(survived == "no") / sum(survived == "no")
  ) |>
  ggplot(aes(FPR, TPR)) +
  geom_line() +
  geom_abline(slope = 1)

Checking Predictions

Testing Multiple Models

list(
  base_model = glm(
    survived ~ sex + age + passengerClass, data = titanic_tbl, family = binomial
  ),
  interactions = glm(
    survived ~ (sex + age + passengerClass)^2, data = titanic_tbl, family = binomial 
  )
) |>
  lapply(\(x) mutate(titanic_tbl, pi = predict(x, type = "response"))) |>
  bind_rows(.id = "model") |>
  arrange(model, desc(pi)) |>
  mutate(
    TPR = cumsum(survived == "yes") / sum(survived == "yes"),
    FPR = cumsum(survived == "no") / sum(survived == "no"),
    .by = model
  ) |>
  ggplot(aes(FPR, TPR, colour = model)) +
  geom_line() +
  geom_abline(slope = 1)

Testing Multiple Models

A Brief Comment

  • If using logistic regression in this context:
    • Usually break data into a training & test set
    • Assess the performance on the test set after training
  • This forms the basis of some Machine Learning techniques

Poisson Regression

  • Another common type of GLM is Poisson regression
  • Fitting the rate of an event occurring (\(\lambda\))
    • e.g. cars at an intersection per hour
    • Rates are always > 0 \(\implies\) estimated on the log scale
  • Poisson distributions have a variance that strictly equals the rate
  • Overdispersed distributions allow for an inflated variance
    • Simply add an overdispersion term to allow for this
    • Very common in bioinformatics, e.g. RNA-Seq

Poisson Regression

  • InsectSprays provides counts of insects after applying insecticides A-F
  • The units used for counting were an equal size (but are undefined)
    • We can assume fewer insects is better
InsectSprays |> 
  summarise(
    ave = mean(count), .by = spray
  )
  spray       ave
1     A 14.500000
2     B 15.333333
3     C  2.083333
4     D  4.916667
5     E  3.500000
6     F 16.666667

Poisson Regression

insect_glm <- glm(count ~ spray, data = InsectSprays, family = poisson)
summary(insect_glm)

Call:
glm(formula = count ~ spray, family = poisson, data = InsectSprays)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.67415    0.07581  35.274  < 2e-16 ***
sprayB       0.05588    0.10574   0.528    0.597    
sprayC      -1.94018    0.21389  -9.071  < 2e-16 ***
sprayD      -1.08152    0.15065  -7.179 7.03e-13 ***
sprayE      -1.42139    0.17192  -8.268  < 2e-16 ***
sprayF       0.13926    0.10367   1.343    0.179    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 409.041  on 71  degrees of freedom
Residual deviance:  98.329  on 66  degrees of freedom
AIC: 376.59

Number of Fisher Scoring iterations: 5
  • How do we interpret these coefficients?

Poisson Regression

  • Model checking for Poisson models
    • Do residuals appear consistently distributed?
plot(
  fitted(insect_glm), 
  residuals(insect_glm, type = "pearson")
)
abline(h = 0, col = "red")

Poisson Regression

  • Model checking for Poisson models
    • Is there evidence of overdispersion?
deviance(insect_glm) / df.residual(insect_glm)
[1] 1.489828
  • If this value is >>1 \(\implies\) overdispersion
  • Maybe try a quasipoisson instead of poisson?
    • family = quasipoisson
    • Will probably be more conservative

A Brief Comment

  • Early RNA-Seq analysis used Poisson models \(\implies\) counts / gene
  • May also be appropriate for matching sequence motifs across a set of DNA sequences
  • If units of counting unequal size \(\implies\) use an offset
    • Counting trees in unevenly sized forests?
    • Number of crimes in cities of different sized populations

Mixed-Effects Models

Mixed-Effects Models

  • Mixed-effects models are very common within ecological research
    • Need to understand the difference between fixed and random effects
  • Linear Regression fits fixed effects
    • Assumes the estimated effect is effectively the same across experiments
    • Reproducible
  • Random effects are the effects of a variable that is not fixed
    • e.g. the effect of a particular site on a response variable
    • The effect of a site is not fixed, it varies across sites
    • We can model this variation using random effects

Mixed-Effects Models

  • How do we model random effects?
  • I learned these as nested models fitting within distinct groups defined by random variables
  • Could be the date of a delivery
    \(\implies\) can be observed and modelled but not reproduced
    • The Intercept term for each group (\(i\)) is \(\mu_i \sim \mathcal{N}(\mu, \sigma_{\mu})\)
    • Response terms follow standard regression assumptions e.g. \(e_{ij} \sim \mathcal{N}(0, \sigma)\)
  • We have two components to the variance

Mixed-Effects Models

  • The syntax in R is super-clumsy
  • Most commonly, people use the package lme4 and the function lmer()
    • Early mixed-effects models used nlme \(\implies\) no longer supported
library(lme4)
data(Machines, package = "nlme")
  • We are measuring productivity using machines A, B or C (fixed)
    • Nested within each of 6 factory workers
    • Each individual will have different productivity
      \(\implies\) measurable & modellable but random across the population

Mixed-Effects Models

  • To denote the nesting or random effect (1 | Worker)
    • Machine is specified as per usual for fixed effects
machines_lmer <- lmer(score ~ Machine + (1 | Worker) , data = Machines) 
summary(machines_lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: score ~ Machine + (1 | Worker)
   Data: Machines

REML criterion at convergence: 286.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7249 -0.5233  0.1328  0.6513  1.7559 

Random effects:
 Groups   Name        Variance Std.Dev.
 Worker   (Intercept) 26.487   5.147   
 Residual              9.996   3.162   
Number of obs: 54, groups:  Worker, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)   52.356      2.229  23.485
MachineB       7.967      1.054   7.559
MachineC      13.917      1.054  13.205

Correlation of Fixed Effects:
         (Intr) MachnB
MachineB -0.236       
MachineC -0.236  0.500

Mixed-Effects Models

  • This output looks very different
  • We have Random effects and Fixed Effects
  • No \(p\)-values!!!
  • The intercept gives the average score for Machine A
  • The random effects indicate how these scores vary across workers
  • We have an estimate and a standard error \(\implies t\)-statistic
    • Degrees of freedom are controversial for mixed-effects models
    • Can’t determine the \(t\)-distribution to use for \(p\)-values

Mixed-Effects Models

  • A compromise is to use lmerTest before fitting
  • Provides \(p\)-values
  • Discussion is beyond the scope of this course (& my knowledge)
library(lmerTest)
machines_lmer <- lmer(score ~ Machine + (1 | Worker) , data = Machines) 
summary(machines_lmer) |> coef()
             Estimate Std. Error        df  t value     Pr(>|t|)
(Intercept) 52.355556   2.229312  5.833185 23.48507 5.328296e-07
MachineB     7.966667   1.053882 46.000000  7.55935 1.329432e-09
MachineC    13.916667   1.053882 46.000000 13.20514 2.933231e-17

Generalised Mixed-Effects Models

  • GLMMs are also heavily used in ecology \(\implies\) require genuine understanding
    • Very heavy on linear algebra
  • Ben Bolker is the undisputed GLMM heavyweight
  • lme4 has implemented glmer()
  • Additional GLMM fitting in the packages glmmADMB and glmmTMB
    • Useful for challenging datasets
  • Doug Bates (nlme) was a member of R Core but has left the R community