library(tidyverse)
library(palmerpenguins)
theme_set(
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)
RAdelaide 2025
July 9, 2025
LinearRegression.R
We are trying to estimate a line with slope & intercept
\[ y = ax + b \]
Or
\[ y = \beta_0 + \beta_1 x \]
Can extend to multi-dimensional predictors
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]
Or even
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \ldots + \beta_n x_n \]
Linear Regression always uses the R
formula syntax
y ~ x
: y
depends on x
lm()
Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-10.1251 -3.0434 -0.8089 2.0711 16.1109
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.690e+01 1.269e+00 21.19 <2e-16 ***
body_mass_g 4.051e-03 2.967e-04 13.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.394 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.3542, Adjusted R-squared: 0.3523
F-statistic: 186.4 on 1 and 340 DF, p-value: < 2.2e-16
Call:
Coefficients
element
How do we interpret the Intercept?
This is what the bill length would be if a penguin weighed exactly 0
How do we interpret the body_mass_g
term?
This is how much we would expect the bill length to change for every one unit increase in the predictor
i.e. for every 1\(g\) increase in weight, the bill length would be expected to increase by about 0.0041mm
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
Linear Regression formally has 4 key assumptions
lm
lm
lm
lm
Shapiro-Wilk normality test
data: bill_length_lm$residuals
W = 0.95439, p-value = 8.217e-09
## Include the species in the model
## NB: This will fit a separate intercept for each species
bill_length_sp_lm <- lm(bill_length_mm ~ species + body_mass_g, data = penguins)
summary(bill_length_sp_lm)
Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-6.8129 -1.6718 0.1336 1.4720 9.2902
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.492e+01 1.063e+00 23.443 < 2e-16 ***
speciesChinstrap 9.921e+00 3.511e-01 28.258 < 2e-16 ***
speciesGentoo 3.558e+00 4.858e-01 7.324 1.78e-12 ***
body_mass_g 3.749e-03 2.823e-04 13.276 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.403 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.808, Adjusted R-squared: 0.8063
F-statistic: 474 on 3 and 338 DF, p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap 9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo 3.557977539 0.4857896978 7.324111 1.776921e-12
body_mass_g 0.003748497 0.0002823439 13.276352 1.158990e-32
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap 9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo 3.557977539 0.4857896978 7.324111 1.776921e-12
body_mass_g 0.003748497 0.0002823439 13.276352 1.158990e-32
R
using an “interaction term” separating terms by :
species:body_mass_g
Call:
lm(formula = bill_length_mm ~ species + body_mass_g + species:body_mass_g,
data = penguins)
Residuals:
Min 1Q Median 3Q Max
-6.4208 -1.6461 0.0919 1.4718 9.3138
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.9941391 1.5926015 16.950 < 2e-16 ***
speciesChinstrap 5.1800537 3.2746719 1.582 0.115
speciesGentoo -0.2545907 2.7138655 -0.094 0.925
body_mass_g 0.0031879 0.0004271 7.464 7.27e-13 ***
speciesChinstrap:body_mass_g 0.0012748 0.0008740 1.459 0.146
speciesGentoo:body_mass_g 0.0009030 0.0006066 1.489 0.138
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.399 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8098, Adjusted R-squared: 0.807
F-statistic: 286.1 on 5 and 336 DF, p-value: < 2.2e-16
body_mass_g
are now both for AdeliespeciesChinstrap:body_mass_g
and speciesGentoo:body_mass_g
The regression line for Adelie is y = 26.99 + 0.0032*body_mass_g
For Chinstrap: y = (26.99+5.18) + (0.0032+0.0013)*body_mass_g
## No need to save this, I'm just demonstrating a point
## The inverse of a matrix is found using `solve()`
## Matrix multiplication is performed using %*%
## The transpose of a matrix is found using `t()`
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
cbind(beta_hat, coef(bill_length_int_lm))
[,1] [,2]
(Intercept) 26.9941391366 26.9941391367
speciesChinstrap 5.1800537287 5.1800537287
speciesGentoo -0.2545906615 -0.2545906615
body_mass_g 0.0031878758 0.0031878758
speciesChinstrap:body_mass_g 0.0012748183 0.0012748183
speciesGentoo:body_mass_g 0.0009029956 0.0009029956
~ 0 + species + body_mass_g.
Analysis of Variance Table
Model 1: bill_length_mm ~ body_mass_g
Model 2: bill_length_mm ~ species + body_mass_g
Model 3: bill_length_mm ~ species + body_mass_g + species:body_mass_g
Res.Df RSS Df Sum of Sq F Pr(>F)
1 340 6564.5
2 338 1952.0 2 4612.5 400.8045 <2e-16 ***
3 336 1933.4 2 18.6 1.6159 0.2003
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*
(species + body_mass_g)^2
would give two-way interactionsstep()
to remove redundant terms
Start: AIC=604.42
bill_length_mm ~ species * body_mass_g
Df Sum of Sq RSS AIC
- species:body_mass_g 2 18.596 1952.0 603.69
<none> 1933.4 604.42
Step: AIC=603.69
bill_length_mm ~ species + body_mass_g
Df Sum of Sq RSS AIC
<none> 1952.0 603.69
- body_mass_g 1 1017.9 2969.9 745.22
- species 2 4612.5 6564.5 1014.48
Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)
Coefficients:
(Intercept) speciesChinstrap speciesGentoo body_mass_g
24.919471 9.920884 3.557978 0.003748
summary.lm
lm
object only had the fitted values
(Intercept) speciesChinstrap speciesGentoo body_mass_g
24.919470977 9.920884113 3.557977539 0.003748497
summary()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap 9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo 3.557977539 0.4857896978 7.324111 1.776921e-12
body_mass_g 0.003748497 0.0002823439 13.276352 1.158990e-32
summary.lm
[1] "summary.lm"
## Prove conclusively that it is really a list
## with a class attribute stuck on it
summary(bill_length_sp_lm) |> is.list()
[1] TRUE
$names
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled" "na.action"
$class
[1] "summary.lm"
summary.lm
List of 12
$ call : language lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)
$ terms :Classes 'terms', 'formula' language bill_length_mm ~ species + body_mass_g
.. ..- attr(*, "variables")= language list(bill_length_mm, species, body_mass_g)
.. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. ..- attr(*, "term.labels")= chr [1:2] "species" "body_mass_g"
.. ..- attr(*, "order")= int [1:2] 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(bill_length_mm, species, body_mass_g)
.. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "factor" "numeric"
.. .. ..- attr(*, "names")= chr [1:3] "bill_length_mm" "species" "body_mass_g"
$ residuals : Named num [1:342] 0.124 0.336 3.198 -1.152 0.699 ...
..- attr(*, "names")= chr [1:342] "1" "2" "3" "5" ...
$ coefficients : num [1:4, 1:4] 24.91947 9.92088 3.55798 0.00375 1.063 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:4] FALSE FALSE FALSE FALSE
..- attr(*, "names")= chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
$ sigma : num 2.4
$ df : int [1:3] 4 338 4
$ r.squared : num 0.808
$ adj.r.squared: num 0.806
$ fstatistic : Named num [1:3] 474 3 338
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:4, 1:4] 1.96e-01 -4.97e-03 6.36e-02 -5.11e-05 -4.97e-03 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
.. ..$ : chr [1:4] "(Intercept)" "speciesChinstrap" "speciesGentoo" "body_mass_g"
$ na.action : 'omit' Named int [1:2] 4 272
..- attr(*, "names")= chr [1:2] "4" "272"
- attr(*, "class")= chr "summary.lm"
tidy()
from the package broom
is a catch-all function
lm
and summary.lm
objectscase_when()
bill_length_sp_lm |>
broom::tidy() |>
mutate(
stars = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
)
# A tibble: 4 × 6
term estimate std.error statistic p.value stars
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) 24.9 1.06 23.4 7.63e-73 ***
2 speciesChinstrap 9.92 0.351 28.3 5.09e-91 ***
3 speciesGentoo 3.56 0.486 7.32 1.78e-12 ***
4 body_mass_g 0.00375 0.000282 13.3 1.16e-32 ***
penguins |>
## Piped data can be recalled using `_`
lm( bill_length_mm ~ species * body_mass_g, data = _) |>
step() |> # Fit the best model using the AIC
broom::tidy() |> # Turn the output into a tibble & add stars
mutate(
stars = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
)
Start: AIC=604.42
bill_length_mm ~ species * body_mass_g
Df Sum of Sq RSS AIC
- species:body_mass_g 2 18.596 1952.0 603.69
<none> 1933.4 604.42
Step: AIC=603.69
bill_length_mm ~ species + body_mass_g
Df Sum of Sq RSS AIC
<none> 1952.0 603.69
- body_mass_g 1 1017.9 2969.9 745.22
- species 2 4612.5 6564.5 1014.48
# A tibble: 4 × 6
term estimate std.error statistic p.value stars
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) 24.9 1.06 23.4 7.63e-73 ***
2 speciesChinstrap 9.92 0.351 28.3 5.09e-91 ***
3 speciesGentoo 3.56 0.486 7.32 1.78e-12 ***
4 body_mass_g 0.00375 0.000282 13.3 1.16e-32 ***
lm
and summary.lm
are S3
objectsR
print.lm()
or print.summary.lm()
lm
object \(\implies\) plot.lm()
broom::tidy()
broom::tidy()
has multiple versions for objects of different classes
broom::tidy.prcomp()
for PCA resultsS3
objects can be easily brokenR
looked for the print.htest
methodpredict.lm()
What’s the difference between a confidence and prediction interval
## Create some new penguins
new_penguins <- tibble(
species = c("Adelie", "Gentoo"),
body_mass_g = 4500
)
## Predict their mean bill length
predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
fit lwr upr
1 41.78771 41.20030 42.37512
2 45.34568 44.81277 45.87860
## Predict the range a new observation may lie in
predict(bill_length_sp_lm, newdata = new_penguins, interval = "prediction")
fit lwr upr
1 41.78771 37.02436 46.55105
2 45.34568 40.58876 50.10261
## Predict their mean bill length, but include the original data
new_penguins |>
bind_cols(
predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
)
# A tibble: 2 × 5
species body_mass_g fit lwr upr
<chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie 4500 41.8 41.2 42.4
2 Gentoo 4500 45.3 44.8 45.9
broom::tidy()
## Use broom:tidy to find confidence intervals for coefficients
bill_length_sp_lm |>
broom::tidy(conf.int = TRUE)
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 24.9 1.06 23.4 7.63e-73 22.8 27.0
2 speciesChinstrap 9.92 0.351 28.3 5.09e-91 9.23 10.6
3 speciesGentoo 3.56 0.486 7.32 1.78e-12 2.60 4.51
4 body_mass_g 0.00375 0.000282 13.3 1.16e-32 0.00319 0.00430
ggpmisc
for adding correlations, regression equations etcggstats
for multiple fresh perspectives on coefficientsggstatsplot
also a wide range of plotting capabilitiessex
in the existing penguins
model?pigs
datasetlen
will be the predictor (\(y\))dose
on the \(x\)-axis using supp
to fill the boxeslen
, using supp
and dose
as the predictorssupp = "VC"
& dose = "High"