Basic Statistics in R

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 9, 2025

Statistics in R

Introduction

R has it’s origins as a statistical analysis language (i.e. S)
Purpose of this session is NOT to teach statistical theory
- I am more of a bioinformatician than statistician
- I did tutor stats for 3 years
Perform simple analyses in R
Up to you to know what you’re doing
- Or talk to your usual statisticians & collaborators

Tests For Continuous Data

Data For This Session

We’ll use the pigs dataset from earlier
Start a new session with new script: BasicStatistics.R

library(tidyverse)
library(scales)
library(car)
theme_set(
  theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)
pigs <- file.path("data", "pigs.csv") |>
    read_csv() |>
    mutate(
      dose = fct(dose, levels = c("Low", "Med", "High")),
      supp = fct(supp)
    )

Data For This Session

pigs |> 
  ggplot(
    aes(x = dose, y = len, fill = supp)
  ) +
    geom_boxplot()

Pop Quiz

Can anyone define a p-value?

A p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
In plain English: If there’s nothing really going on, how likely are we to observe our result, or one even more extreme?
A p-value of 0.05 $\implies$ about 1 in 20 times we’ll see something like this in a random sample

t-tests

Assumes normally distributed data
$t$-tests always test $H_0$ Vs $H_A$
- For data with exactly two groups

The simplest test is on a simple vector
- Not particularly meaningful for our data

?t.test
t.test(some_vector)

What is $H_0$ in the above test?

The true mean of the underlying distribution from which the vector is sampled, is zero: i.e. $\mu = 0$

t-tests

When comparing the means of two vectors

\[ H_0: \mu_{1} = \mu_{2} \\ H_A: \mu_{1} \neq \mu_{2} \]

We could use two vectors (i.e. x & y)

vc <- dplyr::filter(pigs, supp == "VC")$len
oj <- dplyr::filter(pigs, supp == "OJ")$len
t.test(x = vc, y = oj)

Is This a Paired Test?

t-tests

An alternative is the R formula method: len~supp
- Length is a response variable
- Supplement is the predictor
Can only use one predictor for a T-test $\implies$ otherwise it’s linear regression

t.test(len~supp, data = pigs)

Did this give the same results?

t-tests

Do we think the variance is equal between the two groups?

pigs |> summarise(sd = sd(len), .by = supp)

# A tibble: 2 × 2
  supp     sd
  <fct> <dbl>
1 VC     8.27
2 OJ     6.61

We can use Levene’s Test to formalise this
- From the package car
- Bartlett’s test is very similar (bartlett.test())

leveneTest(len~supp, data = pigs)

t-tests

Now we can assume equal variances
- By default, variances are assumed to be unequal

t.test(len~supp, data = pigs, var.equal = TRUE)

If relevant, the confidence interval can also be adjusted

Wilcoxon Tests

We assumed the above dataset was normally distributed:
What if it’s not?

Non-parametric equivalent is the Wilcoxon Rank-Sum Test (aka Mann-Whitney)

This assigns ranks to each value based on their value
- The test is then performed on ranks NOT the values
- Tied values can be problematic
Test that the centre of each underlying distribution is the same

wilcox.test(len~supp, data = pigs)

A Brief Comment

Both of these are suitable for comparing two groups
T-tests assume Normally Distributed Data underlies the random sample
- Are robust to some deviation from normality
- Data can sometimes be transformed (e.g. sqrt(), log() etc)

The Wilcoxon Rank Sum Test assumes nothing about the underlying distribution
- Much less powerful with small sample sizes
- Highly comparable at n $\geq$ 30
The package coin implements a range of non-parametric tests

Tests For Categorical Data

$\chi^2$ Test

Here we need counts and categories
Commonly used in Observed Vs Expected

\[H_0: \text{No association between groups and outcome}\] \[H_A: \text{Association between groups and outcome}\]

When we shouldn’t use a $\chi^2$ test?

When expected cell values are > 5 (Cochran 1954)

$\chi^2$ Test

pass <- matrix(
  c(25, 8, 6, 15), nrow = 2, 
  dimnames = list(
    c("Attended", "Skipped"), 
    c("Pass", "Fail"))
)
pass

         Pass Fail
Attended   25    6
Skipped     8   15

pass_chisq <- chisq.test(pass)
pass_chisq


    Pearson's Chi-squared test with Yates' continuity correction

data:  pass
X-squared = 9.8359, df = 1, p-value = 0.001711

Fisher’s Exact Test

$\chi^2$ tests became popular in the days of the printed tables
- We now have computers
Fisher’s Exact Test is preferable in the cases of low cell counts
- (Or any other time you feel like it…)
Same $H_0$ as the $\chi^2$ test
Uses the hypergeometric distribution

fisher.test(pass)

Summary of Tests

t.test(), wilcox.test()
chisq.test(), fisher.test()

shapiro.test(), bartlett.test()
car::leveneTest()
- Tests for normality or homogeneity of variance

binomial.test(), poisson.test()
kruskal.test(), ks.test()

`htest` Objects

All produce objects of class htest
Is really a list
- Use names() to see what other values are returned

names(pass_chisq)

[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
[7] "expected"  "residuals" "stdres"

Will vary slightly between tests
Can usually extract p-values using test$p.value

pass_chisq$p.value

[1] 0.001711398

`htest` Objects

## Have a look at the list elements produced by fisher.test
fisher.test(pass) |> names()

[1] "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
[6] "method"      "data.name"

## Are these similar to those produced by t.test?
t.test(len~supp, data = pigs) |> names()

 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

There is a function print.htest() which organises the printout for us

References

Cochran, William G. 1954. “Some Methods for Strengthening the Common χ 2 Tests.” Biometrics 10 (4): 417.

Basic Statistics in R

Statistics in R

Introduction

Tests For Continuous Data

Data For This Session

Data For This Session

Pop Quiz

t-tests

t-tests

t-tests

t-tests

t-tests

Wilcoxon Tests

A Brief Comment

Tests For Categorical Data

\(\chi^2\) Test

\(\chi^2\) Test

Fisher’s Exact Test

Summary of Tests

`htest` Objects

`htest` Objects

References

Basic Statistics in R

Statistics in R

Introduction

Tests For Continuous Data

Data For This Session

Data For This Session

Pop Quiz

t-tests

t-tests

t-tests

t-tests

t-tests

Wilcoxon Tests

A Brief Comment

Tests For Categorical Data

\(\chi^2\) Test

\(\chi^2\) Test

Fisher’s Exact Test

Summary of Tests

htest Objects

htest Objects

References

`htest` Objects

`htest` Objects