Linear Regression in R

RAdelaide 2025

Author

Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

Published

July 9, 2025

Linear Regression

Setup

Start clear R session
Create a blank R script: LinearRegression.R
Load the key packages

library(tidyverse)
library(palmerpenguins)
theme_set(
  theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)

Linear Regression

We are trying to estimate a line with slope & intercept

\[ y = ax + b \]

\[ y = \beta_0 + \beta_1 x \]

\(y\) is the response variable
\(x\) is the predictor variable
Makes the most intuitive sense when both \(x\) & \(y\) are continuous

Linear Regression

If we imagine the predictors in two dimensions and the response as the 3rd \(\implies\) still a line…

Can extend to multi-dimensional predictors

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

Or even

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \ldots + \beta_n x_n \]

Linear Regression

Linear Regression always uses the R formula syntax

y ~ x: y depends on x
We use the function lm()
Once we have our model, we can predict \(y\) based on \(x\) values

We’ll use the penguins dataset for some exploration

Linear Regression

## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
penguins |>
  ggplot(
    aes(body_mass_g, bill_length_mm)
  ) +
  geom_point() +
  geom_smooth(method = "lm")

There looks like a fairly clear linear relationship

Linear Regression

Bill length is the response variable (\(y\))
Body Mass is the predictor variable (\(x\))

## Fit a linear regression model and save the output as an R object
bill_length_lm <- lm(bill_length_mm ~ body_mass_g, data = penguins)
bill_length_lm


Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)

Coefficients:
(Intercept)  body_mass_g  
  26.898872     0.004051

Linear Regression

The basic output is only minimally informative
Often pass to the function summary() to present and interpret

## Use `summary` to view and interpret the fitted model
summary(bill_length_lm)


Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.1251  -3.0434  -0.8089   2.0711  16.1109 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.690e+01  1.269e+00   21.19   <2e-16 ***
body_mass_g 4.051e-03  2.967e-04   13.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.394 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.3542,    Adjusted R-squared:  0.3523 
F-statistic: 186.4 on 1 and 340 DF,  p-value: < 2.2e-16

Linear Regression

The code used is returned as Call:
A brief summary of the residuals are returned
The fitted values are returned in the Coefficients element
- We have the estimate of the intercept & slope
- The standard error of the estimates
- \(t\)-statistics testing \(H_0: \beta_i = 0\) along with \(p\)-values
Additional model summary information
- \(R^2\) is the proportion of variance explained by the model

Adjusted R-squared is the R-squared adjusted for the number of predictors
Mostly relevant when comparing models where predictors vary

Coefficients

How do we interpret the Intercept?

This is what the bill length would be if a penguin weighed exactly 0

We’d probably expect this to be zero but Intercepts almost never are
What does this really tell us about the relationship between bill length and weight?

Generally focussed on the relationship within the range of observed predictors
- No guaranteed linear relationship outside of this range

Coefficients

How do we interpret the body_mass_g term?

This is how much we would expect the bill length to change for every one unit increase in the predictor
i.e. for every 1\(g\) increase in weight, the bill length would be expected to increase by about 0.0041mm

The \(t\)-test here is highly relevant
- \(H_0\colon \beta_1 = 0~\) Vs \(~H_A\colon \beta_1 \neq 0\)
Reject \(H_0 \implies\) there is an association between predictor & response

Residuals

Points never lie exactly on the regression line \(\implies\) Why?

We’re actually fitting the model

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

\(\beta_0 + \beta_1 x_i\) is the exact line co-ordinate (Intercept + slope*predictor)
\(\epsilon_i\) is the the vertical difference between the observed value and the fitted value
- Known as a residual \(\implies\) defined as \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)

Residual means the bit left over when we subtract the predicted value from the observed

Assumptions Of Linear Regression

Linear Regression formally has 4 key assumptions

Linear relationship between predictor and response
- Mean of residuals is zero across the entire range
Constant variance across the range of data (homoscedasticity)
Residuals are normally distributed
Independence of errors

Three of these are represented in the definition \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)

Model Checking

To check our fitted model, we should check the residuals to ensure \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)

## Check the residuals. It will ask you the following
## Hit <Return> to see next plot: 
## So follow the instructions and check each plot before proceeding
## There are 4 plots by default
plot(bill_length_lm)

Model Checking

Check the zero mean of \(\mathcal{N}(0, \sigma)\)
Is this assumption satisfied across the range of the data?

Model Checking

Check the normality of \(\mathcal{N}(0, \sigma)\)
The dashed line is the expected line from a normal distribution
Is this assumption satisfied across the range of the data?

Model Checking

Check the constant variance of \(\mathcal{N}(0, \sigma)\)
Is this assumption satisfied across the range of the data?

Model Checking

Checks if any points are exerting excessive ‘leverage’ on the model
Beyond scope of today

Model Checking

All of these figures used base plotting functions
Stepping through can be frustrating
- Especially when running an automated script

Can plot all 4 at once using a simple trick

## Define a 2x2 grid, plotting figures in a row-wise manner
par(mfrow = c(2, 2))
## Plot the four regression diagnostic plots
plot(bill_length_lm)
## Reset the grid layout to be the default 1x1
par(mfrow = c(1, 1,))

par() is a function to set graphical parameters
mfrow means ‘Multi-Figures By Row’
c(2, 2,) means plot in a 2x2 grid
I rarely show these so no need for ggplot versions

Objects Of Class `lm`

The linear model we fitted produced an object of class lm

class(bill_length_lm)

[1] "lm"

This is also a list

## Inspect the actual object
glimpse(bill_length_lm)

Objects Of Class `lm`

We can use the list structure to inspect the residuals manually

## Plot a histogram of the residuals
hist(bill_length_lm$residuals)

Objects Of Class `lm`

We could even use the Shapiro-Wilk test for normality

shapiro.test(bill_length_lm$residuals)


    Shapiro-Wilk normality test

data:  bill_length_lm$residuals
W = 0.95439, p-value = 8.217e-09

How can we interpret all of this?
Maybe there’s a better model

Adding Terms

## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
## Does each species need it's own model
penguins |>
  ggplot(
    aes(
      body_mass_g, bill_length_mm, 
      colour = species
    )
  ) +
  geom_point() +
  geom_smooth(method = "lm")

Do we think these slopes are the same?
Do we think the intercepts are the same?

Adding Terms

## Include the species in the model
## NB: This will fit a separate intercept for each species
bill_length_sp_lm <- lm(bill_length_mm ~ species + body_mass_g, data = penguins) 
summary(bill_length_sp_lm)


Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8129 -1.6718  0.1336  1.4720  9.2902 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.492e+01  1.063e+00  23.443  < 2e-16 ***
speciesChinstrap 9.921e+00  3.511e-01  28.258  < 2e-16 ***
speciesGentoo    3.558e+00  4.858e-01   7.324 1.78e-12 ***
body_mass_g      3.749e-03  2.823e-04  13.276  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.403 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.808, Adjusted R-squared:  0.8063 
F-statistic:   474 on 3 and 338 DF,  p-value: < 2.2e-16

There is already a large increase in R-squared so this model may fit better

Adding Terms

## Check the residuals after 
## including a separate intercept
## for species
par(mfrow = c(2, 2))
plot(bill_length_sp_lm)
par(mfrow = c(1, 1))

Model Diagnostics

## Check the histogram of residuals
hist(bill_length_sp_lm$residuals, breaks = 20)

## Perform the Shapiro-Wilk test for Normality
shapiro.test(bill_length_sp_lm$residuals)


    Shapiro-Wilk normality test

data:  bill_length_sp_lm$residuals
W = 0.99317, p-value = 0.123

Interpreting the Coefficients

Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?

                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32

By default, the primary Intercept is the first factor level in species

levels(penguins$species)

[1] "Adelie"    "Chinstrap" "Gentoo"

Interpreting the Coefficients

Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?

                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32

The baseline intercept is for Adelie penguins
Additional intercept terms are the differences between the baseline and each species
- Does this check-out in our initial plot of the data?
They all appear significant when checking each \(H_0\)

Adding Terms

Do we think each species may have a different relationship between mass and bill length?
- Do we need to fit a separate slope for each species?
This is done in R using an “interaction term” separating terms by :
- species:body_mass_g

bill_length_int_lm <- lm(
  bill_length_mm ~ species + body_mass_g + species:body_mass_g, 
  data = penguins
) 
summary(bill_length_int_lm)


Call:
lm(formula = bill_length_mm ~ species + body_mass_g + species:body_mass_g, 
    data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4208 -1.6461  0.0919  1.4718  9.3138 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  26.9941391  1.5926015  16.950  < 2e-16 ***
speciesChinstrap              5.1800537  3.2746719   1.582    0.115    
speciesGentoo                -0.2545907  2.7138655  -0.094    0.925    
body_mass_g                   0.0031879  0.0004271   7.464 7.27e-13 ***
speciesChinstrap:body_mass_g  0.0012748  0.0008740   1.459    0.146    
speciesGentoo:body_mass_g     0.0009030  0.0006066   1.489    0.138    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.399 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.8098,    Adjusted R-squared:  0.807 
F-statistic: 286.1 on 5 and 336 DF,  p-value: < 2.2e-16

Now what do we think?

Interpreting the Coefficients

The baseline terms for the Intercept and body_mass_g are now both for Adelie
Differences in slope for each species are provided as speciesChinstrap:body_mass_g and speciesGentoo:body_mass_g

The regression line for Adelie is y = 26.99 + 0.0032*body_mass_g
For Chinstrap: y = (26.99+5.18) + (0.0032+0.0013)*body_mass_g

Interpreting the Coefficients

The model matrix \(X\) is a key part of understanding statistics
The model coefficients (\(\hat{\beta}\)) are actually estimated by performing linear algebra on this
The least squares solution is \(\hat{\beta} = (X^TX)^{-1}X^Ty\)

y <- penguins$bill_length_mm[!is.na(penguins$body_mass_g)]
X <- model.matrix(~ species + body_mass_g + species:body_mass_g, data = penguins)

Interpreting the Coefficients

head(X)

  (Intercept) speciesChinstrap speciesGentoo body_mass_g
1           1                0             0        3750
2           1                0             0        3800
3           1                0             0        3250
5           1                0             0        3450
6           1                0             0        3650
7           1                0             0        3625
  speciesChinstrap:body_mass_g speciesGentoo:body_mass_g
1                            0                         0
2                            0                         0
3                            0                         0
5                            0                         0
6                            0                         0
7                            0                         0

Interpreting the Coefficients

## No need to save this, I'm just demonstrating a point
## The inverse of a matrix is found using `solve()`
## Matrix multiplication is performed using %*%
## The transpose of a matrix is found using `t()`
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
cbind(beta_hat, coef(bill_length_int_lm))

                                      [,1]          [,2]
(Intercept)                  26.9941391366 26.9941391367
speciesChinstrap              5.1800537287  5.1800537287
speciesGentoo                -0.2545906615 -0.2545906615
body_mass_g                   0.0031878758  0.0031878758
speciesChinstrap:body_mass_g  0.0012748183  0.0012748183
speciesGentoo:body_mass_g     0.0009029956  0.0009029956

Interpreting the Coefficients

An excellent primer on model matrices is here ¹
Conventional statistics always sets a baseline level for the intercept
- Performs a \(t\)-test automatically for differences
- Really quite convenient if less obvious

We could remove the common intercept using ~ 0 + species + body_mass_g.
- Would return a separate intercept by removing the common (baseline) level
- No real advantage except non-statisticians like it
No statistical evidence provided for differences in intercepts
\(\implies\) have to test ourselves 😢

Model Diagnostics

## Check the residuals after 
## including a separate intercept
## and slope for species
par(mfrow = c(2, 2))
plot(bill_length_int_lm)
par(mfrow = c(1, 1))

Model Selection

How do we decide on the best model?
A common technique is Analysis of Variance (ANOVA)
Classic ANOVA checks importance of each term within a model

## Run a separate ANOVA on each model
anova(bill_length_lm)
anova(bill_length_sp_lm)
anova(bill_length_int_lm)

Does this give any clue as to the best model?

Model Selection

We have progressively added terms to each model
Can use ANOVA to compare suitability of each model

## Compare the models using ANOVA
## If a model is a significant improvement, the p-value from the 
## F test will be clearly significant
anova(
  bill_length_lm,
  bill_length_sp_lm,
  bill_length_int_lm
)

Analysis of Variance Table

Model 1: bill_length_mm ~ body_mass_g
Model 2: bill_length_mm ~ species + body_mass_g
Model 3: bill_length_mm ~ species + body_mass_g + species:body_mass_g
  Res.Df    RSS Df Sum of Sq        F Pr(>F)    
1    340 6564.5                                 
2    338 1952.0  2    4612.5 400.8045 <2e-16 ***
3    336 1933.4  2      18.6   1.6159 0.2003    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It looks like the separate slopes are not an improvement
The separate intercepts are an improvement

Speeding The Process Up

This was a careful breakdown of finding the best model
We can partially automate this and use some shortcuts

The shorthand for an interaction term with separate intercepts is *

## Refit the model with an interaction term using the shorthand code
bill_length_int_lm <- lm(
  bill_length_mm ~ species * body_mass_g, data = penguins
)
summary(bill_length_int_lm)

Alternatively, all terms can be placed inside brackets & raised to a power
- (species + body_mass_g)^2 would give two-way interactions

Speeding The Process Up

The AIC is a function of the number of model terms and the log likelihood function
Beyond the scope of today
Doesn’t perform well with highly correlated predictors

After specifying a ‘full’ model \(\implies\) use step() to remove redundant terms
- Removes terms in a step-wise manner
- Uses Akaike’s Information Criterion (AIC) to determine optimal model
- Finds model with lowest AIC
- AIC penalises the number of terms in a model
  \(\implies\) encourages simpler models

step(bill_length_int_lm)

Start:  AIC=604.42
bill_length_mm ~ species * body_mass_g

                      Df Sum of Sq    RSS    AIC
- species:body_mass_g  2    18.596 1952.0 603.69
<none>                             1933.4 604.42

Step:  AIC=603.69
bill_length_mm ~ species + body_mass_g

              Df Sum of Sq    RSS     AIC
<none>                     1952.0  603.69
- body_mass_g  1    1017.9 2969.9  745.22
- species      2    4612.5 6564.5 1014.48


Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)

Coefficients:
     (Intercept)  speciesChinstrap     speciesGentoo       body_mass_g  
       24.919471          9.920884          3.557978          0.003748

Objects of Class `summary.lm`

The coefficients element of the basic lm object only had the fitted values
- Not the std errors, t-tests or p-values

## Extract the coefficients directly from the linear model
bill_length_sp_lm$coefficients

     (Intercept) speciesChinstrap    speciesGentoo      body_mass_g 
    24.919470977      9.920884113      3.557977539      0.003748497

These values are produced by the function summary()

## Extract the coefficients from the summary object
summary(bill_length_sp_lm)$coefficients

                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32

Objects of Class `summary.lm`

## Check the class of the output from summary
summary(bill_length_sp_lm) |> class()

[1] "summary.lm"

## Prove conclusively that it is really a list
## with a class attribute stuck on it
summary(bill_length_sp_lm) |> is.list()

[1] TRUE

## See what attributes it has
summary(bill_length_sp_lm) |> attributes()

$names
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"    

$class
[1] "summary.lm"

Objects of Class `summary.lm`

## Check the full details
summary(bill_length_sp_lm) |> glimpse()

List of 12
 $ call         : language lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)
 $ terms        :Classes 'terms', 'formula'  language bill_length_mm ~ species + body_mass_g
  .. ..- attr(*, "variables")= language list(bill_length_mm, species, body_mass_g)
  .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. ..- attr(*, "term.labels")= chr [1:2] "species" "body_mass_g"
  .. ..- attr(*, "order")= int [1:2] 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(bill_length_mm, species, body_mass_g)
  .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "factor" "numeric"
  .. .. ..- attr(*, "names")= chr [1:3] "bill_length_mm" "species" "body_mass_g"
 $ residuals    : Named num [1:342] 0.124 0.336 3.198 -1.152 0.699 ...
  ..- attr(*, "names")= chr [1:342] "1" "2" "3" "5" ...
 $ coefficients : num [1:4, 1:4] 24.91947 9.92088 3.55798 0.00375 1.063 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:4] FALSE FALSE FALSE FALSE
  ..- attr(*, "names")= chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
 $ sigma        : num 2.4
 $ df           : int [1:3] 4 338 4
 $ r.squared    : num 0.808
 $ adj.r.squared: num 0.806
 $ fstatistic   : Named num [1:3] 474 3 338
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:4, 1:4] 1.96e-01 -4.97e-03 6.36e-02 -5.11e-05 -4.97e-03 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
  .. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
 $ na.action    : 'omit' Named int [1:2] 4 272
  ..- attr(*, "names")= chr [1:2] "4" "272"
 - attr(*, "class")= chr "summary.lm"

The full complexity of the object is mostly irrelevant

Using A More Tidyverse Friendly Approach

The function tidy() from the package broom is a catch-all function
- Will return a tibble
- Returns the same from lm and summary.lm objects

bill_length_sp_lm |> broom::tidy()

# A tibble: 4 × 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      24.9      1.06         23.4  7.63e-73
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91
3 speciesGentoo     3.56     0.486         7.32 1.78e-12
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32

Adding Significance Stars

The easiest way for me as a case_when()

bill_length_sp_lm |> 
  broom::tidy() |>
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01 ~ "**",
      p.value < 0.05 ~ "*",
      p.value < 0.1 ~ ".",
      TRUE ~ ""
    )
  )

# A tibble: 4 × 6
  term             estimate std.error statistic  p.value stars
  <chr>               <dbl>     <dbl>     <dbl>    <dbl> <chr>
1 (Intercept)      24.9      1.06         23.4  7.63e-73 ***  
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91 ***  
3 speciesGentoo     3.56     0.486         7.32 1.78e-12 ***  
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32 ***

Or Fitting a Model On The Fly

Obviously no room for checking model diagnostics

penguins |>
  ## Piped data can be recalled using `_`
  lm( bill_length_mm ~ species * body_mass_g, data = _) |> 
  step() |> # Fit the best model using the AIC
  broom::tidy() |> # Turn the output into a tibble & add stars
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01 ~ "**",
      p.value < 0.05 ~ "*",
      p.value < 0.1 ~ ".",
      TRUE ~ ""
    )
  )

Start:  AIC=604.42
bill_length_mm ~ species * body_mass_g

                      Df Sum of Sq    RSS    AIC
- species:body_mass_g  2    18.596 1952.0 603.69
<none>                             1933.4 604.42

Step:  AIC=603.69
bill_length_mm ~ species + body_mass_g

              Df Sum of Sq    RSS     AIC
<none>                     1952.0  603.69
- body_mass_g  1    1017.9 2969.9  745.22
- species      2    4612.5 6564.5 1014.48

# A tibble: 4 × 6
  term             estimate std.error statistic  p.value stars
  <chr>               <dbl>     <dbl>     <dbl>    <dbl> <chr>
1 (Intercept)      24.9      1.06         23.4  7.63e-73 ***  
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91 ***  
3 speciesGentoo     3.56     0.486         7.32 1.78e-12 ***  
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32 ***

S3 Method Dispatch

Objects with class lm and summary.lm are S3 objects
Very informal class structure in R
Easy to work with \(\implies\) easy to break

When we printed these objects \(\implies\) print.lm() or print.summary.lm()
Likewise when we plotted the lm object \(\implies\) plot.lm()

?plot.lm

S3 Method Dispatch

If we only want a subset of figures

par(mfrow = c(1, 2))
plot(
  bill_length_sp_lm, which = c(1, 3),
  caption = list(
    "Separate Intercepts: Residuals vs Fitted", "",
    "Separate Intercepts: Scale-Location"
  )
)

par(mfrow = c(1, 1))

S3 Method Dispatch

We can also check what we pass to broom::tidy()

?broom::tidy.lm

broom::tidy() has multiple versions for objects of different classes
- e.g. broom::tidy.prcomp() for PCA results

S3 Method Dispatch

S3 objects can be easily broken

broken_lm <- bill_length_sp_lm
class(broken_lm) <- "htest"
broken_lm




data:

R looked for the print.htest method
The structure didn’t match what was expected \(\implies\) nonsense output

Confidence Intervals

For confidence or prediction intervals, we can use predict.lm()

What’s the difference between a confidence and prediction interval

95% Confidence Intervals: The mean will be in the given interval 95% of the time
95% Prediction Intervals: An observation will be in the given interval 95% of the time

We will need a new data frame to make the predictions about

Confidence Intervals

## Create some new penguins
new_penguins <- tibble(
  species = c("Adelie", "Gentoo"), 
  body_mass_g = 4500
)
## Predict their mean bill length
predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")

       fit      lwr      upr
1 41.78771 41.20030 42.37512
2 45.34568 44.81277 45.87860

## Predict the range a new observation may lie in
predict(bill_length_sp_lm, newdata = new_penguins, interval = "prediction")

       fit      lwr      upr
1 41.78771 37.02436 46.55105
2 45.34568 40.58876 50.10261

Confidence Intervals

We may wish to include our new penguins in these results

## Predict their mean bill length, but include the original data
new_penguins |>
  bind_cols(
    predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
  )

# A tibble: 2 × 5
  species body_mass_g   fit   lwr   upr
  <chr>         <dbl> <dbl> <dbl> <dbl>
1 Adelie         4500  41.8  41.2  42.4
2 Gentoo         4500  45.3  44.8  45.9

Confidence Intervals

Confidence Intervals for model terms can also be found using broom::tidy()

## Use broom:tidy to find confidence intervals for coefficients
bill_length_sp_lm |>
  broom::tidy(conf.int = TRUE)

# A tibble: 4 × 7
  term             estimate std.error statistic  p.value conf.low conf.high
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)      24.9      1.06         23.4  7.63e-73 22.8      27.0    
2 speciesChinstrap  9.92     0.351        28.3  5.09e-91  9.23     10.6    
3 speciesGentoo     3.56     0.486         7.32 1.78e-12  2.60      4.51   
4 body_mass_g       0.00375  0.000282     13.3  1.16e-32  0.00319   0.00430

Closing Comments

Additional Plotting Comments

I rarely show diagnostic plots:
- For me when fitting data & performing analysis
- No need for ggplot versions

tibble(
  fitted= fitted(bill_length_sp_lm), 
  residuals = resid(bill_length_sp_lm)
) |> 
  ggplot(aes(fitted, residuals)) + 
  geom_point() + 
  geom_smooth(
    se = FALSE, colour = 'red', 
    linewidth = 0.5
    ) +
  ggtitle("Residuals vs Fitted") +
  labs(
    x = "Fitted Values", y = "Residuals"
  )

tibble(
  residuals = rstandard(bill_length_sp_lm)
) |> 
  ggplot(aes(sample = residuals)) + 
  geom_qq() +
  geom_qq_line() +
  ggtitle("Q-Q Residuals") +
  labs(
    x = "Theoretical Quantiles", 
    y = "Standardized Residuals"
  )

Additional Plotting Packages

Multiple options for great looking plots
- ggpmisc for adding correlations, regression equations etc
- ggstats for multiple fresh perspectives on coefficients
- ggstatsplot also a wide range of plotting capabilities

Challenges

How could we account for sex in the existing penguins model?
Load the pigs dataset
Make a boxplot:
- len will be the predictor (\(y\))
- Place dose on the \(x\)-axis using supp to fill the boxes
Find the best model for len, using supp and dose as the predictors
Check the residuals and diagnostic plots
Make sure you can interpret the model coefficients
What is the 99% confidence interval for supp = "VC" & dose = "High"
- What does this mean?

Footnotes

https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html↩︎

Linear Regression

Setup

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Coefficients

Coefficients

Residuals

Assumptions Of Linear Regression

Model Checking

Model Checking

Model Checking

Model Checking

Model Checking

Model Checking

Objects Of Class lm

Objects Of Class lm

Objects Of Class lm

Adding Terms

Adding Terms

Adding Terms

Model Diagnostics

Interpreting the Coefficients

Interpreting the Coefficients

Adding Terms

Interpreting the Coefficients

Interpreting the Coefficients

Interpreting the Coefficients

Interpreting the Coefficients

Interpreting the Coefficients

Model Diagnostics

Model Selection

Model Selection

Speeding The Process Up

Speeding The Process Up

Objects of Class summary.lm

Objects of Class summary.lm

Objects of Class summary.lm

Using A More Tidyverse Friendly Approach

Adding Significance Stars

Or Fitting a Model On The Fly

S3 Method Dispatch

S3 Method Dispatch

S3 Method Dispatch

S3 Method Dispatch

S3 Method Dispatch

Confidence Intervals

Confidence Intervals

Confidence Intervals

Confidence Intervals

Closing Comments

Additional Plotting Comments

Additional Plotting Packages

Challenges

Footnotes

Objects Of Class `lm`

Objects Of Class `lm`

Objects Of Class `lm`

Objects of Class `summary.lm`

Objects of Class `summary.lm`

Objects of Class `summary.lm`