?Distributions
Basic Statistics in R
RAdelaide 2025
Statistics in R
Introduction
R
has it’s origins as a statistical analysis language (i.e.S
)- Purpose of this session is NOT to teach statistical theory
- I am more of a bioinformatician than statistician
- I did tutor stats for 3 years
- Perform simple analyses in R
- Up to you to know what you’re doing
- Or talk to your usual statisticians & collaborators
Distributions
R
comes with nearly every distribution- Standard syntax for accessing each
The T Distribution
- A T distribution looks very much like a Standard normal N(0, 1) but has heavier tails
- This allows for greater uncertainty in the tails
Tests For Continuous Data
Data For This Session
- We’ll use the
pigs
dataset from earlier - Start a new session with new script:
BasicStatistics.R
library(tidyverse)
library(scales)
library(car)
theme_set(
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)<- file.path("data", "pigs.csv") |>
pigs read_csv() |>
mutate(
dose = fct(dose, levels = c("Low", "Med", "High")),
supp = fct(supp)
)
Pop Quiz
Can anyone define a p-value?
- A p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
- In plain English: If there’s nothing really going on, how likely are we to observe our result, or one even more extreme?
- A p-value of 0.05 \(\implies\) about 1 in 20 times we’ll see something like this in a random sample
t-tests
- Assumes normally distributed data
- \(t\)-tests always test \(H_0\) Vs \(H_A\)
- For data with exactly two groups
- The simplest test is on a simple vector
- Not particularly meaningful for our data
?t.testt.test(some_vector)
What is \(H_0\) in the above test?
The true mean of the underlying distribution from which the vector is sampled, is zero: i.e. \(\mu = 0\)
Wilcoxon Tests
- We assumed the above dataset was normally distributed:
What if it’s not?
- Non-parametric equivalent is the Wilcoxon Rank-Sum Test (aka Mann-Whitney)
- This assigns ranks to each value based on their value
- The test is then performed on ranks NOT the values
- Tied values can be problematic
- Test that the centre of each underlying distribution is the same
wilcox.test(len~supp, data = pigs)
A Brief Comment
- Both of these are suitable for comparing two groups
- T-tests assume Normally Distributed Data underlies the random sample
- Are robust to some deviation from normality
- Data can sometimes be transformed (e.g.
sqrt()
,log()
etc)
- The Wilcoxon Rank Sum Test assumes nothing about the underlying distribution
- Much less powerful with small sample sizes
- Highly comparable at n \(\geq\) 30
- The package
coin
implements a range of non-parametric tests
Tests For Categorical Data
\(\chi^2\) Test
- Here we need counts and categories
- Commonly used in Observed Vs Expected
\[H_0: \text{No association between groups and outcome}\] \[H_A: \text{Association between groups and outcome}\]
When we shouldn’t use a \(\chi^2\) test?
When expected cell values are > 5 (Cochran 1954)
Fisher’s Exact Test
- \(\chi^2\) tests became popular in the days of the printed tables
- We now have computers
- Fisher’s Exact Test is preferable in the cases of low cell counts
- (Or any other time you feel like it…)
- Same \(H_0\) as the \(\chi^2\) test
- Uses the hypergeometric distribution
fisher.test(pass)
Summary of Tests
t.test()
,wilcox.test()
chisq.test()
,fisher.test()
shapiro.test()
,bartlett.test()
car::leveneTest()
- Tests for normality or homogeneity of variance
binomial.test()
,poisson.test()
kruskal.test()
,ks.test()
htest
Objects
- All produce objects of class
htest
- Is really a
list
- Use
names()
to see what other values are returned
- Use
names(pass_chisq)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals" "stdres"
- Will vary slightly between tests
- Can usually extract p-values using
test$p.value
$p.value pass_chisq
[1] 0.001711398
References
Cochran, William G. 1954. “Some Methods for Strengthening the Common χ 2 Tests.” Biometrics 10 (4): 417.