Basic Statistics in R

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 9, 2025

Statistics in R

Introduction

  • R has it’s origins as a statistical analysis language (i.e. S)
  • Purpose of this session is NOT to teach statistical theory
    • I am more of a bioinformatician than statistician
    • I did tutor stats for 3 years
  • Perform simple analyses in R
  • Up to you to know what you’re doing
    • Or talk to your usual statisticians & collaborators

Tests For Continuous Data

Data For This Session

  • We’ll use the pigs dataset from earlier
  • Start a new session with new script: BasicStatistics.R
library(tidyverse)
library(scales)
library(car)
theme_set(
  theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)
pigs <- file.path("data", "pigs.csv") |>
    read_csv() |>
    mutate(
      dose = fct(dose, levels = c("Low", "Med", "High")),
      supp = fct(supp)
    )

Data For This Session

pigs |> 
  ggplot(
    aes(x = dose, y = len, fill = supp)
  ) +
    geom_boxplot()

Pop Quiz

Can anyone define a p-value?

  • A p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
  • In plain English: If there’s nothing really going on, how likely are we to observe our result, or one even more extreme?
  • A p-value of 0.05 \(\implies\) about 1 in 20 times we’ll see something like this in a random sample

t-tests

  • Assumes normally distributed data
  • \(t\)-tests always test \(H_0\) Vs \(H_A\)
    • For data with exactly two groups
  • The simplest test is on a simple vector
    • Not particularly meaningful for our data
?t.test
t.test(some_vector)

What is \(H_0\) in the above test?

The true mean of the underlying distribution from which the vector is sampled, is zero: i.e. \(\mu = 0\)

t-tests

When comparing the means of two vectors

\[ H_0: \mu_{1} = \mu_{2} \\ H_A: \mu_{1} \neq \mu_{2} \]

We could use two vectors (i.e. x & y)

vc <- dplyr::filter(pigs, supp == "VC")$len
oj <- dplyr::filter(pigs, supp == "OJ")$len
t.test(x = vc, y = oj)

Is This a Paired Test?

No

t-tests

  • An alternative is the R formula method: len~supp
    • Length is a response variable
    • Supplement is the predictor
  • Can only use one predictor for a T-test \(\implies\) otherwise it’s linear regression
t.test(len~supp, data = pigs)

Did this give the same results?

t-tests

  • Do we think the variance is equal between the two groups?
pigs |> summarise(sd = sd(len), .by = supp)
# A tibble: 2 × 2
  supp     sd
  <fct> <dbl>
1 VC     8.27
2 OJ     6.61
  • We can use Levene’s Test to formalise this
    • From the package car
    • Bartlett’s test is very similar (bartlett.test())
leveneTest(len~supp, data = pigs)

t-tests

  • Now we can assume equal variances
    • By default, variances are assumed to be unequal
t.test(len~supp, data = pigs, var.equal = TRUE)
  • If relevant, the confidence interval can also be adjusted

Wilcoxon Tests

  • We assumed the above dataset was normally distributed:
    What if it’s not?
  • Non-parametric equivalent is the Wilcoxon Rank-Sum Test (aka Mann-Whitney)
  • This assigns ranks to each value based on their value
    • The test is then performed on ranks NOT the values
    • Tied values can be problematic
  • Test that the centre of each underlying distribution is the same
wilcox.test(len~supp, data = pigs)

A Brief Comment

  • Both of these are suitable for comparing two groups
  • T-tests assume Normally Distributed Data underlies the random sample
    • Are robust to some deviation from normality
    • Data can sometimes be transformed (e.g. sqrt(), log() etc)
  • The Wilcoxon Rank Sum Test assumes nothing about the underlying distribution
    • Much less powerful with small sample sizes
    • Highly comparable at n \(\geq\) 30
  • The package coin implements a range of non-parametric tests

Tests For Categorical Data

\(\chi^2\) Test

  • Here we need counts and categories
  • Commonly used in Observed Vs Expected

\[H_0: \text{No association between groups and outcome}\] \[H_A: \text{Association between groups and outcome}\]

When we shouldn’t use a \(\chi^2\) test?

When expected cell values are > 5 (Cochran 1954)

\(\chi^2\) Test

pass <- matrix(
  c(25, 8, 6, 15), nrow = 2, 
  dimnames = list(
    c("Attended", "Skipped"), 
    c("Pass", "Fail"))
)
pass
         Pass Fail
Attended   25    6
Skipped     8   15


pass_chisq <- chisq.test(pass)
pass_chisq

    Pearson's Chi-squared test with Yates' continuity correction

data:  pass
X-squared = 9.8359, df = 1, p-value = 0.001711

Fisher’s Exact Test

  • \(\chi^2\) tests became popular in the days of the printed tables
    • We now have computers
  • Fisher’s Exact Test is preferable in the cases of low cell counts
    • (Or any other time you feel like it…)
  • Same \(H_0\) as the \(\chi^2\) test
  • Uses the hypergeometric distribution
fisher.test(pass)

Summary of Tests

  • t.test(), wilcox.test()
  • chisq.test(), fisher.test()
  • shapiro.test(), bartlett.test()
  • car::leveneTest()
    • Tests for normality or homogeneity of variance
  • binomial.test(), poisson.test()
  • kruskal.test(), ks.test()

htest Objects

  • All produce objects of class htest
  • Is really a list
    • Use names() to see what other values are returned
names(pass_chisq)
[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
[7] "expected"  "residuals" "stdres"   
  • Will vary slightly between tests
  • Can usually extract p-values using test$p.value
pass_chisq$p.value
[1] 0.001711398

htest Objects

## Have a look at the list elements produced by fisher.test
fisher.test(pass) |> names()
[1] "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
[6] "method"      "data.name"  


## Are these similar to those produced by t.test?
t.test(len~supp, data = pigs) |> names()
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
  • There is a function print.htest() which organises the printout for us

References

Cochran, William G. 1954. “Some Methods for Strengthening the Common χ 2 Tests.” Biometrics 10 (4): 417.