?Distributions
Basic Statistics in R
RAdelaide 2024
Statistics in R
Introduction
R
has it’s origins as a statistical analysis language (i.e.S
)- Purpose of this session is NOT to teach statistical theory
- I am a bioinformatician NOT a statistician
- Perform simple analyses in R
- Up to you to know what you’re doing
- Or talk to your usual statisticians & collaborators
Distributions
R
comes with nearly every distribution- Standard syntax for accessing each
Distributions
Distribution | Density | Area Under Curve | Quantile | Random |
---|---|---|---|---|
Normal | dnorm() |
pnorm() |
qnorm() |
rnorm() |
T | dt() |
pt() |
qt() |
rt() |
Uniform | dunif() |
punif() |
qunif() |
runif() |
Exponential | dexp() |
pexp() |
qexp() |
rexp() |
\(\chi^2\) | dchisq() |
pchisq() |
qchisq() |
rchisq() |
Binomial | dbinom() |
pbinom() |
qbinom() |
rbinom() |
Poisson | dpois() |
ppois() |
qpois() |
rpois() |
Distributions
- Also Beta, \(\Gamma\), Log-Normal, F, Geometric, Cauchy, Hypergeometric etc…
Distributions
## dnorm gives the classic bell-curve
tibble(
x = seq(-4, 4, length.out = 1e3)
%>%
) ggplot(aes(x, y = dnorm(x))) +
geom_line()
## pnorm gives the area under the
## bell-curve (which sums to 1)
tibble(
x = seq(-4, 4, length.out = 1e3)
%>%
) ggplot(aes(x, y = pnorm(x))) +
geom_line()
Basic Tests
Data For This Session
We’ll use the pigs
dataset from earlier
library(tidyverse)
<- file.path("data", "pigs.csv") %>%
pigs %>%
read_csv mutate(dose = factor(dose, levels = c("Low", "Med", "High")))
Data For This Session
theme_set(theme_bw())
%>%
pigs ggplot(
aes(x = dose, y = len, fill = supp)
+
) geom_boxplot()
t-tests
- Assumes normally distributed data
- \(t\)-tests always test \(H_0\) Vs \(H_A\)
. . .
- The simplest test is on a simple vector
- This is not particularly meaningful for our data
?t.testt.test(pigs$len)
What is \(H_0\) in the above test?
t-tests
When comparing the means of two vectors
\[ H_0: \mu_{1} = \mu_{2} \\ H_A: \mu_{1} \neq \mu_{2} \]
We could use two vectors (i.e. x
& y
)
<- dplyr::filter(pigs, supp == "VC")$len
vc <- dplyr::filter(pigs, supp == "OJ")$len
oj t.test(x = vc, y = oj)
. . .
Is this a paired test?
t-tests
- An alternative is the
R
formula method:len~supp
- Length is a response variable
- Supplement is the predictor
- Can only use one predictor for a T-test
- Otherwise it’s linear regression
t.test(len~supp, data = pigs)
Did this give the same results?
Wilcoxon Tests
- We assumed the above dataset was normally distributed:
What if it’s not?
. . .
- Non-parametric equivalent is the Wilcoxon Rank-Sum (aka Mann-Whitney)
. . .
- This assigns ranks to each value based on their value
- Tied values can be problematic
- Values not used in calculation of the test statistic
wilcox.test(len~supp, data = pigs)
\(\chi^2\) Test
- Here we need counts
- Commonly used in Observed Vs Expected
\[ H_0: \text{No association between groups and outcome}\\ H_A: \text{Association between groups and outcome} \]
\(\chi^2\) Test
<- matrix(c(25, 8, 6, 15), nrow = 2)
pass colnames(pass) <- c("Pass", "Fail")
rownames(pass) <- c("Attended", "Skipped")
passchisq.test(pass)
. . .
Can anyone remember when we shouldn’t use a \(\chi^2\) test?
Fisher’s Exact Test
- \(\chi^2\) tests became popular in the days of the printed tables
- We now have computers
- Fisher’s Exact Test is preferable in the cases of low cell counts
- (Or any other time…)
- Same \(H_0\) as the \(\chi^2\) test
- Uses the hypergeometric distribution
fisher.test(pass)
Summary of Tests
t.test()
,wilcox.test()
chisq.test()
,fisher.test()
. . .
shapiro.test()
,bartlett.test()
- Tests for normality or homogeneity of variance
. . .
binomial.test()
,poisson.test()
kruskal.test()
,ks.test()
htest
Objects
- All produce objects of class
htest
- Use
print.htest()
to display results - Is really a list
- Use
names()
to see what other values are returned
- Use
. . .
- Can usually extract p-values using
test$p.value
fisher.test(pass)$p.value
Regression
Linear Regression
We are trying to estimate a line with slope & intercept
\[ y = ax + b \]
. . .
Or
\[ y = \beta_0 + \beta_1 x \]
Linear Regression
Linear Regression always uses the R
formula syntax
y ~ x
:y
is a function ofx
- We use the function
lm()
<- lm(len ~ supp , data = pigs)
lm_pigs summary(lm_pigs)
. . .
- Intercept is assumed unless explicitly removed (
~ 0 + ...
)
Linear Regression
- It looks like
supp == VC
reduces the length of the teeth - In reality we’d like to see if dose has an effect as well
<- lm(len ~ supp + dose, data = pigs)
lm_pigs_dose summary(lm_pigs_dose)
. . .
- Which values are associated with the intercept & slope?
. . .
- It looks like an increasing dose-level increases length
Interaction Terms
- We have given each group a separate intercept
- The same slope
- Requires an interaction term for different slopes
<- lm(len ~ supp + dose + supp:dose, data = pigs)
lm_pigs_full summary(lm_pigs_full)
- How do we interpret this?
Interaction Terms
An alternative way to write the above in R
is:
<- lm(len ~ (supp + dose)^2, data = pigs)
lm_pigs_full summary(lm_pigs_full)
Model Selection
Which model should we choose?
anova(lm_pigs, lm_pigs_dose, lm_pigs_full)
Model Selection
Are we happy with our model assumptions?
- Normally distributed
- Constant Variance
- Linear relationship
plot(lm_pigs_full)
Model Selection
- This creates plots using base graphics
- To show them all on the same panel
par(mfrow = c(2, 2))
plot(lm_pigs_full)
Visualising Residuals
mfrow()
stands for multi-frame row- Needs to be reset to a single frame
par(mfrow = c(1, 1))
Logistic Regression
- Logistic Regression models probabilities (e.g. \(H_0: \pi = 0\))
- We can specify two columns to the model
- One would represent total successes, the other failures
- This is
binomial
data, \(\pi\) is the probability of success
. . .
- Alternatively the response might be a vector of
TRUE/FALSE
or0/1
Logistic Regression
- The probability of admission to a PhD1
- Graduate Record Exam scores
- Grade Point Average
- Prestige of admitting institution (1 is most prestigous)
<- read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
admissions glimpse(admissions)
Rows: 400
Columns: 4
$ admit <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ gre <dbl> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 440, 760,…
$ gpa <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3.92, 4.00…
$ rank <dbl> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2, 1, 3, 2…
Logistic Regression
glm(admit ~ gre + gpa + rank, data = admissions, family = "binomial") %>%
summary()
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = admissions)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.449548 1.132846 -3.045 0.00233 **
gre 0.002294 0.001092 2.101 0.03564 *
gpa 0.777014 0.327484 2.373 0.01766 *
rank -0.560031 0.127137 -4.405 1.06e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 459.44 on 396 degrees of freedom
AIC: 467.44
Number of Fisher Scoring iterations: 4
Logistic Regression
- Probabilities are fit on the logit scale
- Transforms 0 < \(\pi\) < 1 to \(-\infty\) < logit(\(\pi\)) < \(\infty\)
\[ \text{logit}(\pi) = \log(\frac{\pi}{1-\pi}) \]
Automated Model Selection
- One strategy during model fitting is to fit a heavily parameterised model
R
can remove terms as required using Akaike’s Information Criterion (AIC)- The function is
step()
- The function is
. . .
glm(admit ~ (gre + gpa + rank)^2, data = admissions, family = "binomial") %>%
step() %>%
summary()
. . .
- Here we do end up with the same model
Mixed Effects Models
Mixed effects models include:
- Fixed effects & 2) Random effects
May need to nest results within a biological sample, include day effects etc.
<- MASS::Rabbit
Rabbit head(Rabbit)
Mixed Effects Models
Here we have the change in Blood pressure within the same 5 rabbits
- 6 dose levels of control + 6 dose levels of
MDL
- Just looking within one rabbit
filter(Rabbit, Animal == "R1")
Mixed Effects Models
If fitting within one rabbit \(\implies\) use lm()
<- lm(
lm_r1 ~(Treatment + Dose)^2, data = Rabbit, subset = Animal == "R1"
BPchange
)summary(lm_r1)
Mixed Effects Models
To nest within each rabbit we:
- Use
lmer()
fromlme4
- Introduce a random effect
(1|Animal)
- Captures variance between rabbits
library(lme4)
<- lmer(BPchange~Treatment + Dose + (1|Animal), data = Rabbit)
lme_rabbit summary(lme_rabbit)
coef(summary(lme_rabbit))
- Mixed-effects models were originally fitted using
nlme
- No longer being developed
Mixed Effects Models
This gives \(t\)-statistics, but no \(p\)-value
Why?
. . .
library(lmerTest)
<- lmer(BPchange~Treatment + Dose + (1|Animal), data = Rabbit)
lme_rabbit summary(lme_rabbit)
Mixed Effects Models
- Doug Bates & Ben Bolker are key R experts in this field
- Doug has left the R community
- Lots of discussion for issues estimating DF with random effects
- Ben Bolker also maintains
glmmTMB
andglmmADMB
- Generalised Mixed-effects Models
- https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html
Two linear algebra equations in & I was lost
Modelling Summary
lm()
for standard regression modelsglm()
for generalised linear modelslmer()
for linear mixed-effects models
. . .
- Robust models are implemented in
MASS
Other Statistical Tools
Mutiple Testing in R
- The function
p.adjust()
takes the argumentmethod = ...
p.adjust.methods
[1] "holm" "hochberg" "hommel" "bonferroni" "BH"
[6] "BY" "fdr" "none"
. . .
%>%
lm_pigs_full ::tidy() %>%
broommutate(
adjP = p.adjust(p.value, "bonf"),
sig = case_when(
< 0.001 ~ "***",
adjP < 0.01 ~ "**",
adjP < 0.05 ~ "*",
adjP < 0.1 ~ ".",
adjP TRUE ~ ""
) )
. . .
Also the package multcomp
is excellent but challenging
PCA
- Here we have 50 genes, from two T cell types: Both Stimulated & Resting 2
<- read_csv("data/geneExpression.csv") genes
. . .
- PCA requires a numeric matrix
<- genes %>%
gene_mat as.data.frame() %>%
column_to_rownames("ID") %>%
as.matrix()
PCA
- Our variable of interest here is the cell-types
- We need to set that as the row variable:
- Transpose the data using
t()
- Default settings need tweaking (
S
reverse compatability)
- Transpose the data using
<- gene_mat %>%
pca t() %>%
prcomp(center = TRUE, scale. = TRUE)
summary(pca)
biplot(pca)
screeplot(pca)
. . .
- These plots aren’t great…
PCA
- The output of
prcomp()
is a list with classprcomp
- Components are in
pca$x
- Gene loadings are in
pca$rotation
- Contributions of each gene to each component
. . .
%>% broom::tidy() pca
PCA
- My go-to visualisation trick is
%>%
pca ::tidy() %>%
broompivot_wider(names_from = "PC", values_from = "value", names_prefix = "PC") %>%
ggplot(aes(PC1, PC2)) +
geom_point()
PCA Challenge
In this dataset:
- Treg & Th cells
- Resting encoded with
-
- Stimulated encoded with
+
- Final number is the donor
. . .
- Colour the points by cell type and set the shape by treatment
- Add labels to each point
(Hint: Useggrepel::geom_label_repel()
)