Linear Regression in R

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 9, 2025

Linear Regression

Setup

  • Start clear R session
  • Create a blank R script: LinearRegression.R
  • Load the key packages
library(tidyverse)
library(palmerpenguins)
theme_set(
  theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)

Linear Regression

We are trying to estimate a line with slope & intercept

\[ y = ax + b \]

Or

\[ y = \beta_0 + \beta_1 x \]

  • \(y\) is the response variable
  • \(x\) is the predictor variable
  • Makes the most intuitive sense when both \(x\) & \(y\) are continuous

Linear Regression

Can extend to multi-dimensional predictors

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

Or even

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \ldots + \beta_n x_n \]

Linear Regression

Linear Regression always uses the R formula syntax

  • y ~ x: y depends on x
  • We use the function lm()
  • Once we have our model, we can predict \(y\) based on \(x\) values
  • We’ll use the penguins dataset for some exploration

Linear Regression

## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
penguins |>
  ggplot(
    aes(body_mass_g, bill_length_mm)
  ) +
  geom_point() +
  geom_smooth(method = "lm")

Linear Regression

  • Bill length is the response variable (\(y\))
  • Body Mass is the predictor variable (\(x\))
## Fit a linear regression model and save the output as an R object
bill_length_lm <- lm(bill_length_mm ~ body_mass_g, data = penguins)
bill_length_lm

Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)

Coefficients:
(Intercept)  body_mass_g  
  26.898872     0.004051  

Linear Regression

## Use `summary` to view and interpret the fitted model
summary(bill_length_lm)

Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.1251  -3.0434  -0.8089   2.0711  16.1109 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.690e+01  1.269e+00   21.19   <2e-16 ***
body_mass_g 4.051e-03  2.967e-04   13.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.394 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.3542,    Adjusted R-squared:  0.3523 
F-statistic: 186.4 on 1 and 340 DF,  p-value: < 2.2e-16

Linear Regression

  1. The code used is returned as Call:
  2. A brief summary of the residuals are returned
  3. The fitted values are returned in the Coefficients element
    • We have the estimate of the intercept & slope
    • The standard error of the estimates
    • \(t\)-statistics testing \(H_0: \beta_i = 0\) along with \(p\)-values
  4. Additional model summary information
    • \(R^2\) is the proportion of variance explained by the model

Coefficients

How do we interpret the Intercept?

This is what the bill length would be if a penguin weighed exactly 0

  • We’d probably expect this to be zero but Intercepts almost never are
  • What does this really tell us about the relationship between bill length and weight?
  • Generally focussed on the relationship within the range of observed predictors
    • No guaranteed linear relationship outside of this range

Coefficients

How do we interpret the body_mass_g term?

This is how much we would expect the bill length to change for every one unit increase in the predictor
i.e. for every 1\(g\) increase in weight, the bill length would be expected to increase by about 0.0041mm

  • The \(t\)-test here is highly relevant
    • \(H_0\colon \beta_1 = 0~\) Vs \(~H_A\colon \beta_1 \neq 0\)
  • Reject \(H_0 \implies\) there is an association between predictor & response

Residuals

  • Points never lie exactly on the regression line \(\implies\) Why?
  • We’re actually fitting the model

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

  • \(\beta_0 + \beta_1 x_i\) is the exact line co-ordinate (Intercept + slope*predictor)
  • \(\epsilon_i\) is the the vertical difference between the observed value and the fitted value
    • Known as a residual \(\implies\) defined as \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)

Assumptions Of Linear Regression

Linear Regression formally has 4 key assumptions

  1. Linear relationship between predictor and response
    • Mean of residuals is zero across the entire range
  2. Constant variance across the range of data (homoscedasticity)
  3. Residuals are normally distributed
  4. Independence of errors
  • Three of these are represented in the definition \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)

Model Checking

  • To check our fitted model, we should check the residuals to ensure \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
## Check the residuals. It will ask you the following
## Hit <Return> to see next plot: 
## So follow the instructions and check each plot before proceeding
## There are 4 plots by default
plot(bill_length_lm)

Model Checking


  • Check the zero mean of \(\mathcal{N}(0, \sigma)\)
  • Is this assumption satisfied across the range of the data?

Model Checking


  • Check the normality of \(\mathcal{N}(0, \sigma)\)
  • The dashed line is the expected line from a normal distribution
  • Is this assumption satisfied across the range of the data?

Model Checking


  • Check the constant variance of \(\mathcal{N}(0, \sigma)\)
  • Is this assumption satisfied across the range of the data?

Model Checking


  • Checks if any points are exerting excessive ‘leverage’ on the model
  • Beyond scope of today

Model Checking

  • All of these figures used base plotting functions
  • Stepping through can be frustrating
    • Especially when running an automated script
  • Can plot all 4 at once using a simple trick
## Define a 2x2 grid, plotting figures in a row-wise manner
par(mfrow = c(2, 2))
## Plot the four regression diagnostic plots
plot(bill_length_lm)
## Reset the grid layout to be the default 1x1
par(mfrow = c(1, 1,))

Objects Of Class lm

  • The linear model we fitted produced an object of class lm
class(bill_length_lm)
[1] "lm"
  • This is also a list
## Inspect the actual object
glimpse(bill_length_lm)

Objects Of Class lm

  • We can use the list structure to inspect the residuals manually
## Plot a histogram of the residuals
hist(bill_length_lm$residuals)

Objects Of Class lm

  • We could even use the Shapiro-Wilk test for normality
shapiro.test(bill_length_lm$residuals)

    Shapiro-Wilk normality test

data:  bill_length_lm$residuals
W = 0.95439, p-value = 8.217e-09
  • How can we interpret all of this?
  • Maybe there’s a better model

Adding Terms

## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
## Does each species need it's own model
penguins |>
  ggplot(
    aes(
      body_mass_g, bill_length_mm, 
      colour = species
    )
  ) +
  geom_point() +
  geom_smooth(method = "lm")

Adding Terms

## Include the species in the model
## NB: This will fit a separate intercept for each species
bill_length_sp_lm <- lm(bill_length_mm ~ species + body_mass_g, data = penguins) 
summary(bill_length_sp_lm)

Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8129 -1.6718  0.1336  1.4720  9.2902 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.492e+01  1.063e+00  23.443  < 2e-16 ***
speciesChinstrap 9.921e+00  3.511e-01  28.258  < 2e-16 ***
speciesGentoo    3.558e+00  4.858e-01   7.324 1.78e-12 ***
body_mass_g      3.749e-03  2.823e-04  13.276  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.403 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.808, Adjusted R-squared:  0.8063 
F-statistic:   474 on 3 and 338 DF,  p-value: < 2.2e-16

Adding Terms

## Check the residuals after 
## including a separate intercept
## for species
par(mfrow = c(2, 2))
plot(bill_length_sp_lm)
par(mfrow = c(1, 1))

Model Diagnostics

## Check the histogram of residuals
hist(bill_length_sp_lm$residuals, breaks = 20)
## Perform the Shapiro-Wilk test for Normality
shapiro.test(bill_length_sp_lm$residuals)

    Shapiro-Wilk normality test

data:  bill_length_sp_lm$residuals
W = 0.99317, p-value = 0.123

Interpreting the Coefficients

  • Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?
                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32
  • By default, the primary Intercept is the first factor level in species
levels(penguins$species)
[1] "Adelie"    "Chinstrap" "Gentoo"   

Interpreting the Coefficients

  • Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?
                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32
  • The baseline intercept is for Adelie penguins
  • Additional intercept terms are the differences between the baseline and each species
    • Does this check-out in our initial plot of the data?
  • They all appear significant when checking each \(H_0\)

Adding Terms

  • Do we think each species may have a different relationship between mass and bill length?
    • Do we need to fit a separate slope for each species?
  • This is done in R using an “interaction term” separating terms by :
    • species:body_mass_g
bill_length_int_lm <- lm(
  bill_length_mm ~ species + body_mass_g + species:body_mass_g, 
  data = penguins
) 
summary(bill_length_int_lm)

Adding Terms


Call:
lm(formula = bill_length_mm ~ species + body_mass_g + species:body_mass_g, 
    data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4208 -1.6461  0.0919  1.4718  9.3138 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  26.9941391  1.5926015  16.950  < 2e-16 ***
speciesChinstrap              5.1800537  3.2746719   1.582    0.115    
speciesGentoo                -0.2545907  2.7138655  -0.094    0.925    
body_mass_g                   0.0031879  0.0004271   7.464 7.27e-13 ***
speciesChinstrap:body_mass_g  0.0012748  0.0008740   1.459    0.146    
speciesGentoo:body_mass_g     0.0009030  0.0006066   1.489    0.138    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.399 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.8098,    Adjusted R-squared:  0.807 
F-statistic: 286.1 on 5 and 336 DF,  p-value: < 2.2e-16

Interpreting the Coefficients

  • The baseline terms for the Intercept and body_mass_g are now both for Adelie
  • Differences in slope for each species are provided as speciesChinstrap:body_mass_g and speciesGentoo:body_mass_g
  • The regression line for Adelie is y = 26.99 + 0.0032*body_mass_g

  • For Chinstrap: y = (26.99+5.18) + (0.0032+0.0013)*body_mass_g

Interpreting the Coefficients

  • The model matrix \(X\) is a key part of understanding statistics
  • The model coefficients (\(\hat{\beta}\)) are actually estimated by performing linear algebra on this
  • The least squares solution is \(\hat{\beta} = (X^TX)^{-1}X^Ty\)
y <- penguins$bill_length_mm[!is.na(penguins$body_mass_g)]
X <- model.matrix(~ species + body_mass_g + species:body_mass_g, data = penguins)

Interpreting the Coefficients

head(X)
  (Intercept) speciesChinstrap speciesGentoo body_mass_g
1           1                0             0        3750
2           1                0             0        3800
3           1                0             0        3250
5           1                0             0        3450
6           1                0             0        3650
7           1                0             0        3625
  speciesChinstrap:body_mass_g speciesGentoo:body_mass_g
1                            0                         0
2                            0                         0
3                            0                         0
5                            0                         0
6                            0                         0
7                            0                         0

Interpreting the Coefficients

## No need to save this, I'm just demonstrating a point
## The inverse of a matrix is found using `solve()`
## Matrix multiplication is performed using %*%
## The transpose of a matrix is found using `t()`
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
cbind(beta_hat, coef(bill_length_int_lm))
                                      [,1]          [,2]
(Intercept)                  26.9941391366 26.9941391367
speciesChinstrap              5.1800537287  5.1800537287
speciesGentoo                -0.2545906615 -0.2545906615
body_mass_g                   0.0031878758  0.0031878758
speciesChinstrap:body_mass_g  0.0012748183  0.0012748183
speciesGentoo:body_mass_g     0.0009029956  0.0009029956

Interpreting the Coefficients

  • An excellent primer on model matrices is here1
  • Conventional statistics always sets a baseline level for the intercept
    • Performs a \(t\)-test automatically for differences
    • Really quite convenient if less obvious
  • We could remove the common intercept using ~ 0 + species + body_mass_g.
    • Would return a separate intercept by removing the common (baseline) level
    • No real advantage except non-statisticians like it
  • No statistical evidence provided for differences in intercepts
    \(\implies\) have to test ourselves 😢

Model Diagnostics

## Check the residuals after 
## including a separate intercept
## and slope for species
par(mfrow = c(2, 2))
plot(bill_length_int_lm)
par(mfrow = c(1, 1))

Model Selection

  • How do we decide on the best model?
  • A common technique is Analysis of Variance (ANOVA)
  • Classic ANOVA checks importance of each term within a model
## Run a separate ANOVA on each model
anova(bill_length_lm)
anova(bill_length_sp_lm)
anova(bill_length_int_lm)
  • Does this give any clue as to the best model?

Model Selection

  • We have progressively added terms to each model
  • Can use ANOVA to compare suitability of each model
## Compare the models using ANOVA
## If a model is a significant improvement, the p-value from the 
## F test will be clearly significant
anova(
  bill_length_lm,
  bill_length_sp_lm,
  bill_length_int_lm
)
  • It looks like the separate slopes are not an improvement
  • The separate intercepts are an improvement

Model Selection

Analysis of Variance Table

Model 1: bill_length_mm ~ body_mass_g
Model 2: bill_length_mm ~ species + body_mass_g
Model 3: bill_length_mm ~ species + body_mass_g + species:body_mass_g
  Res.Df    RSS Df Sum of Sq        F Pr(>F)    
1    340 6564.5                                 
2    338 1952.0  2    4612.5 400.8045 <2e-16 ***
3    336 1933.4  2      18.6   1.6159 0.2003    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Speeding The Process Up

  • This was a careful breakdown of finding the best model
  • We can partially automate this and use some shortcuts
  • The shorthand for an interaction term with separate intercepts is *
## Refit the model with an interaction term using the shorthand code
bill_length_int_lm <- lm(
  bill_length_mm ~ species * body_mass_g, data = penguins
)
summary(bill_length_int_lm)
  • Alternatively, all terms can be placed inside brackets & raised to a power
    • (species + body_mass_g)^2 would give two-way interactions

Speeding The Process Up

  • After specifying a ‘full’ model \(\implies\) use step() to remove redundant terms
    • Removes terms in a step-wise manner
    • Uses Akaike’s Information Criterion (AIC) to determine optimal model
    • Finds model with lowest AIC
    • AIC penalises the number of terms in a model
      \(\implies\) encourages simpler models
step(bill_length_int_lm)

Speeding The Process Up

Start:  AIC=604.42
bill_length_mm ~ species * body_mass_g

                      Df Sum of Sq    RSS    AIC
- species:body_mass_g  2    18.596 1952.0 603.69
<none>                             1933.4 604.42

Step:  AIC=603.69
bill_length_mm ~ species + body_mass_g

              Df Sum of Sq    RSS     AIC
<none>                     1952.0  603.69
- body_mass_g  1    1017.9 2969.9  745.22
- species      2    4612.5 6564.5 1014.48

Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)

Coefficients:
     (Intercept)  speciesChinstrap     speciesGentoo       body_mass_g  
       24.919471          9.920884          3.557978          0.003748  

Objects of Class summary.lm

  • The coefficients element of the basic lm object only had the fitted values
    • Not the std errors, t-tests or p-values
## Extract the coefficients directly from the linear model
bill_length_sp_lm$coefficients
     (Intercept) speciesChinstrap    speciesGentoo      body_mass_g 
    24.919470977      9.920884113      3.557977539      0.003748497 
  • These values are produced by the function summary()
## Extract the coefficients from the summary object
summary(bill_length_sp_lm)$coefficients
                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32

Objects of Class summary.lm

## Check the class of the output from summary
summary(bill_length_sp_lm) |> class()
[1] "summary.lm"
## Prove conclusively that it is really a list
## with a class attribute stuck on it
summary(bill_length_sp_lm) |> is.list()
[1] TRUE
## See what attributes it has
summary(bill_length_sp_lm) |> attributes()
$names
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"    

$class
[1] "summary.lm"

Objects of Class summary.lm

## Check the full details
summary(bill_length_sp_lm) |> glimpse()
List of 12
 $ call         : language lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)
 $ terms        :Classes 'terms', 'formula'  language bill_length_mm ~ species + body_mass_g
  .. ..- attr(*, "variables")= language list(bill_length_mm, species, body_mass_g)
  .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. ..- attr(*, "term.labels")= chr [1:2] "species" "body_mass_g"
  .. ..- attr(*, "order")= int [1:2] 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(bill_length_mm, species, body_mass_g)
  .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "factor" "numeric"
  .. .. ..- attr(*, "names")= chr [1:3] "bill_length_mm" "species" "body_mass_g"
 $ residuals    : Named num [1:342] 0.124 0.336 3.198 -1.152 0.699 ...
  ..- attr(*, "names")= chr [1:342] "1" "2" "3" "5" ...
 $ coefficients : num [1:4, 1:4] 24.91947 9.92088 3.55798 0.00375 1.063 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:4] FALSE FALSE FALSE FALSE
  ..- attr(*, "names")= chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
 $ sigma        : num 2.4
 $ df           : int [1:3] 4 338 4
 $ r.squared    : num 0.808
 $ adj.r.squared: num 0.806
 $ fstatistic   : Named num [1:3] 474 3 338
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:4, 1:4] 1.96e-01 -4.97e-03 6.36e-02 -5.11e-05 -4.97e-03 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
  .. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
 $ na.action    : 'omit' Named int [1:2] 4 272
  ..- attr(*, "names")= chr [1:2] "4" "272"
 - attr(*, "class")= chr "summary.lm"
  • The full complexity of the object is mostly irrelevant

Using A More Tidyverse Friendly Approach

  • The function tidy() from the package broom is a catch-all function
    • Will return a tibble
    • Returns the same from lm and summary.lm objects
bill_length_sp_lm |> broom::tidy()
# A tibble: 4 × 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      24.9      1.06         23.4  7.63e-73
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91
3 speciesGentoo     3.56     0.486         7.32 1.78e-12
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32

Adding Significance Stars

  • The easiest way for me as a case_when()
bill_length_sp_lm |> 
  broom::tidy() |>
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01 ~ "**",
      p.value < 0.05 ~ "*",
      p.value < 0.1 ~ ".",
      TRUE ~ ""
    )
  )
# A tibble: 4 × 6
  term             estimate std.error statistic  p.value stars
  <chr>               <dbl>     <dbl>     <dbl>    <dbl> <chr>
1 (Intercept)      24.9      1.06         23.4  7.63e-73 ***  
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91 ***  
3 speciesGentoo     3.56     0.486         7.32 1.78e-12 ***  
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32 ***  

Or Fitting a Model On The Fly

  • Obviously no room for checking model diagnostics
penguins |>
  ## Piped data can be recalled using `_`
  lm( bill_length_mm ~ species * body_mass_g, data = _) |> 
  step() |> # Fit the best model using the AIC
  broom::tidy() |> # Turn the output into a tibble & add stars
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01 ~ "**",
      p.value < 0.05 ~ "*",
      p.value < 0.1 ~ ".",
      TRUE ~ ""
    )
  )

Or Fitting a Model On The Fly

Start:  AIC=604.42
bill_length_mm ~ species * body_mass_g

                      Df Sum of Sq    RSS    AIC
- species:body_mass_g  2    18.596 1952.0 603.69
<none>                             1933.4 604.42

Step:  AIC=603.69
bill_length_mm ~ species + body_mass_g

              Df Sum of Sq    RSS     AIC
<none>                     1952.0  603.69
- body_mass_g  1    1017.9 2969.9  745.22
- species      2    4612.5 6564.5 1014.48
# A tibble: 4 × 6
  term             estimate std.error statistic  p.value stars
  <chr>               <dbl>     <dbl>     <dbl>    <dbl> <chr>
1 (Intercept)      24.9      1.06         23.4  7.63e-73 ***  
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91 ***  
3 speciesGentoo     3.56     0.486         7.32 1.78e-12 ***  
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32 ***  

S3 Method Dispatch

S3 Method Dispatch

  • Objects with class lm and summary.lm are S3 objects
  • Very informal class structure in R
  • Easy to work with \(\implies\) easy to break
  • When we printed these objects \(\implies\) print.lm() or print.summary.lm()
  • Likewise when we plotted the lm object \(\implies\) plot.lm()
?plot.lm

S3 Method Dispatch

  • If we only want a subset of figures
par(mfrow = c(1, 2))
plot(
  bill_length_sp_lm, which = c(1, 3),
  caption = list(
    "Separate Intercepts: Residuals vs Fitted", "",
    "Separate Intercepts: Scale-Location"
  )
)
par(mfrow = c(1, 1))

S3 Method Dispatch

  • We can also check what we pass to broom::tidy()
?broom::tidy.lm
  • broom::tidy() has multiple versions for objects of different classes
    • e.g. broom::tidy.prcomp() for PCA results

S3 Method Dispatch

  • S3 objects can be easily broken
broken_lm <- bill_length_sp_lm
class(broken_lm) <- "htest"
broken_lm



data:  
  • R looked for the print.htest method
  • The structure didn’t match what was expected \(\implies\) nonsense output

Confidence Intervals

  • For confidence or prediction intervals, we can use predict.lm()

What’s the difference between a confidence and prediction interval

  • 95% Confidence Intervals: The mean will be in the given interval 95% of the time
  • 95% Prediction Intervals: An observation will be in the given interval 95% of the time
  • We will need a new data frame to make the predictions about

Confidence Intervals

## Create some new penguins
new_penguins <- tibble(
  species = c("Adelie", "Gentoo"), 
  body_mass_g = 4500
)
## Predict their mean bill length
predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
       fit      lwr      upr
1 41.78771 41.20030 42.37512
2 45.34568 44.81277 45.87860
## Predict the range a new observation may lie in
predict(bill_length_sp_lm, newdata = new_penguins, interval = "prediction")
       fit      lwr      upr
1 41.78771 37.02436 46.55105
2 45.34568 40.58876 50.10261

Confidence Intervals

  • We may wish to include our new penguins in these results
## Predict their mean bill length, but include the original data
new_penguins |>
  bind_cols(
    predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
  )
# A tibble: 2 × 5
  species body_mass_g   fit   lwr   upr
  <chr>         <dbl> <dbl> <dbl> <dbl>
1 Adelie         4500  41.8  41.2  42.4
2 Gentoo         4500  45.3  44.8  45.9

Confidence Intervals

  • Confidence Intervals for model terms can also be found using broom::tidy()
## Use broom:tidy to find confidence intervals for coefficients
bill_length_sp_lm |>
  broom::tidy(conf.int = TRUE) 
# A tibble: 4 × 7
  term             estimate std.error statistic  p.value conf.low conf.high
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)      24.9      1.06         23.4  7.63e-73 22.8      27.0    
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91  9.23     10.6    
3 speciesGentoo     3.56     0.486         7.32 1.78e-12  2.60      4.51   
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32  0.00319   0.00430

Closing Comments

Additional Plotting Comments

  • I rarely show diagnostic plots:
    • For me when fitting data & performing analysis
    • No need for ggplot versions

tibble(
  fitted= fitted(bill_length_sp_lm), 
  residuals = resid(bill_length_sp_lm)
) |> 
  ggplot(aes(fitted, residuals)) + 
  geom_point() + 
  geom_smooth(
    se = FALSE, colour = 'red', 
    linewidth = 0.5
    ) +
  ggtitle("Residuals vs Fitted") +
  labs(
    x = "Fitted Values", y = "Residuals"
  ) 

tibble(
  residuals = rstandard(bill_length_sp_lm)
) |> 
  ggplot(aes(sample = residuals)) + 
  geom_qq() +
  geom_qq_line() +
  ggtitle("Q-Q Residuals") +
  labs(
    x = "Theoretical Quantiles", 
    y = "Standardized Residuals"
  ) 

Additional Plotting Packages

  • Multiple options for great looking plots
    • ggpmisc for adding correlations, regression equations etc
    • ggstats for multiple fresh perspectives on coefficients
    • ggstatsplot also a wide range of plotting capabilities

Challenges

  1. How could we account for sex in the existing penguins model?
  2. Load the pigs dataset
  3. Make a boxplot:
    • len will be the predictor (\(y\))
    • Place dose on the \(x\)-axis using supp to fill the boxes
  4. Find the best model for len, using supp and dose as the predictors
  5. Check the residuals and diagnostic plots
  6. Make sure you can interpret the model coefficients
  7. What is the 99% confidence interval for supp = "VC" & dose = "High"
    • What does this mean?