library(tidyverse)
library(palmerpenguins)
theme_set(
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)
Linear Regression in R
RAdelaide 2025
Linear Regression
Setup
- Start clear R session
- Create a blank R script:
LinearRegression.R
- Load the key packages
Linear Regression
We are trying to estimate a line with slope & intercept
\[ y = ax + b \]
Or
\[ y = \beta_0 + \beta_1 x \]
- \(y\) is the response variable
- \(x\) is the predictor variable
- Makes the most intuitive sense when both \(x\) & \(y\) are continuous
Coefficients
How do we interpret the Intercept?
This is what the bill length would be if a penguin weighed exactly 0
- We’d probably expect this to be zero but Intercepts almost never are
- What does this really tell us about the relationship between bill length and weight?
- Generally focussed on the relationship within the range of observed predictors
- No guaranteed linear relationship outside of this range
Residuals
- Points never lie exactly on the regression line \(\implies\) Why?
- We’re actually fitting the model
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
- \(\beta_0 + \beta_1 x_i\) is the exact line co-ordinate (Intercept + slope*predictor)
- \(\epsilon_i\) is the the vertical difference between the observed value and the fitted value
- Known as a residual \(\implies\) defined as \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
- Residual means the bit left over when we subtract the predicted value from the observed
Assumptions Of Linear Regression
Linear Regression formally has 4 key assumptions
- Linear relationship between predictor and response
- Mean of residuals is zero across the entire range
- Constant variance across the range of data (homoscedasticity)
- Residuals are normally distributed
- Independence of errors
- Three of these are represented in the definition \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
Model Checking
- To check our fitted model, we should check the residuals to ensure \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
## Check the residuals. It will ask you the following
## Hit <Return> to see next plot:
## So follow the instructions and check each plot before proceeding
## There are 4 plots by default
plot(bill_length_lm)
Objects Of Class lm
- The linear model we fitted produced an object of class
lm
class(bill_length_lm)
[1] "lm"
- This is also a list
## Inspect the actual object
glimpse(bill_length_lm)
Adding Terms
## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
## Does each species need it's own model
|>
penguins ggplot(
aes(
body_mass_g, bill_length_mm, colour = species
)+
) geom_point() +
geom_smooth(method = "lm")
- Do we think these slopes are the same?
- Do we think the intercepts are the same?
Model Diagnostics
## Check the histogram of residuals
hist(bill_length_sp_lm$residuals, breaks = 20)
## Perform the Shapiro-Wilk test for Normality
shapiro.test(bill_length_sp_lm$residuals)
Shapiro-Wilk normality test
data: bill_length_sp_lm$residuals
W = 0.99317, p-value = 0.123
Interpreting the Coefficients
- Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap 9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo 3.557977539 0.4857896978 7.324111 1.776921e-12
body_mass_g 0.003748497 0.0002823439 13.276352 1.158990e-32
- By default, the primary Intercept is the first factor level in
species
levels(penguins$species)
[1] "Adelie" "Chinstrap" "Gentoo"
Model Diagnostics
## Check the residuals after
## including a separate intercept
## and slope for species
par(mfrow = c(2, 2))
plot(bill_length_int_lm)
par(mfrow = c(1, 1))
Model Selection
- How do we decide on the best model?
- A common technique is Analysis of Variance (ANOVA)
- Classic ANOVA checks importance of each term within a model
## Run a separate ANOVA on each model
anova(bill_length_lm)
anova(bill_length_sp_lm)
anova(bill_length_int_lm)
- Does this give any clue as to the best model?
Speeding The Process Up
- This was a careful breakdown of finding the best model
- We can partially automate this and use some shortcuts
- The shorthand for an interaction term with separate intercepts is
*
## Refit the model with an interaction term using the shorthand code
<- lm(
bill_length_int_lm ~ species * body_mass_g, data = penguins
bill_length_mm
)summary(bill_length_int_lm)
- Alternatively, all terms can be placed inside brackets & raised to a power
(species + body_mass_g)^2
would give two-way interactions
Objects of Class summary.lm
- The coefficients element of the basic
lm
object only had the fitted values- Not the std errors, t-tests or p-values
## Extract the coefficients directly from the linear model
$coefficients bill_length_sp_lm
(Intercept) speciesChinstrap speciesGentoo body_mass_g
24.919470977 9.920884113 3.557977539 0.003748497
- These values are produced by the function
summary()
## Extract the coefficients from the summary object
summary(bill_length_sp_lm)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap 9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo 3.557977539 0.4857896978 7.324111 1.776921e-12
body_mass_g 0.003748497 0.0002823439 13.276352 1.158990e-32
Using A More Tidyverse Friendly Approach
- The function
tidy()
from the packagebroom
is a catch-all function- Will return a tibble
- Returns the same from
lm
andsummary.lm
objects
|> broom::tidy() bill_length_sp_lm
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 24.9 1.06 23.4 7.63e-73
2 speciesChinstrap 9.92 0.351 28.3 5.09e-91
3 speciesGentoo 3.56 0.486 7.32 1.78e-12
4 body_mass_g 0.00375 0.000282 13.3 1.16e-32
Adding Significance Stars
- The easiest way for me as a
case_when()
|>
bill_length_sp_lm ::tidy() |>
broommutate(
stars = case_when(
< 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
p.value TRUE ~ ""
) )
# A tibble: 4 × 6
term estimate std.error statistic p.value stars
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) 24.9 1.06 23.4 7.63e-73 ***
2 speciesChinstrap 9.92 0.351 28.3 5.09e-91 ***
3 speciesGentoo 3.56 0.486 7.32 1.78e-12 ***
4 body_mass_g 0.00375 0.000282 13.3 1.16e-32 ***
Or Fitting a Model On The Fly
- Obviously no room for checking model diagnostics
|>
penguins ## Piped data can be recalled using `_`
lm( bill_length_mm ~ species * body_mass_g, data = _) |>
step() |> # Fit the best model using the AIC
::tidy() |> # Turn the output into a tibble & add stars
broommutate(
stars = case_when(
< 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
p.value TRUE ~ ""
) )
Start: AIC=604.42
bill_length_mm ~ species * body_mass_g
Df Sum of Sq RSS AIC
- species:body_mass_g 2 18.596 1952.0 603.69
<none> 1933.4 604.42
Step: AIC=603.69
bill_length_mm ~ species + body_mass_g
Df Sum of Sq RSS AIC
<none> 1952.0 603.69
- body_mass_g 1 1017.9 2969.9 745.22
- species 2 4612.5 6564.5 1014.48
# A tibble: 4 × 6
term estimate std.error statistic p.value stars
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) 24.9 1.06 23.4 7.63e-73 ***
2 speciesChinstrap 9.92 0.351 28.3 5.09e-91 ***
3 speciesGentoo 3.56 0.486 7.32 1.78e-12 ***
4 body_mass_g 0.00375 0.000282 13.3 1.16e-32 ***
S3 Method Dispatch
S3 Method Dispatch
- Objects with class
lm
andsummary.lm
areS3
objects - Very informal class structure in
R
- Easy to work with \(\implies\) easy to break
- When we printed these objects \(\implies\)
print.lm()
orprint.summary.lm()
- Likewise when we plotted the
lm
object \(\implies\)plot.lm()
?plot.lm
Confidence Intervals
- For confidence or prediction intervals, we can use
predict.lm()
What’s the difference between a confidence and prediction interval
- 95% Confidence Intervals: The mean will be in the given interval 95% of the time
- 95% Prediction Intervals: An observation will be in the given interval 95% of the time
- We will need a new data frame to make the predictions about
Closing Comments
Additional Plotting Comments
- I rarely show diagnostic plots:
- For me when fitting data & performing analysis
- No need for ggplot versions
tibble(
fitted= fitted(bill_length_sp_lm),
residuals = resid(bill_length_sp_lm)
|>
) ggplot(aes(fitted, residuals)) +
geom_point() +
geom_smooth(
se = FALSE, colour = 'red',
linewidth = 0.5
+
) ggtitle("Residuals vs Fitted") +
labs(
x = "Fitted Values", y = "Residuals"
)
tibble(
residuals = rstandard(bill_length_sp_lm)
|>
) ggplot(aes(sample = residuals)) +
geom_qq() +
geom_qq_line() +
ggtitle("Q-Q Residuals") +
labs(
x = "Theoretical Quantiles",
y = "Standardized Residuals"
)
Additional Plotting Packages
- Multiple options for great looking plots
ggpmisc
for adding correlations, regression equations etcggstats
for multiple fresh perspectives on coefficientsggstatsplot
also a wide range of plotting capabilities
Challenges
- How could we account for
sex
in the existingpenguins
model? - Load the
pigs
dataset - Make a boxplot:
len
will be the predictor (\(y\))- Place
dose
on the \(x\)-axis usingsupp
to fill the boxes
- Find the best model for
len
, usingsupp
anddose
as the predictors - Check the residuals and diagnostic plots
- Make sure you can interpret the model coefficients
- What is the 99% confidence interval for
supp = "VC"
&dose = "High"
- What does this mean?
Footnotes
https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html↩︎