Basic Statistics in R

RAdelaide 2025

Author

Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

Published

July 9, 2025

Statistics in R

Introduction

R has it’s origins as a statistical analysis language (i.e. S)
Purpose of this session is NOT to teach statistical theory
- I am more of a bioinformatician than statistician
- I did tutor stats for 3 years
Perform simple analyses in R
Up to you to know what you’re doing
- Or talk to your usual statisticians & collaborators

Distributions

R comes with nearly every distribution
Standard syntax for accessing each

Distributions

Distribution	Density	Area Under Curve	Quantile	Random
Normal	`dnorm()`	`pnorm()`	`qnorm()`	`rnorm()`
T	`dt()`	`pt()`	`qt()`	`rt()`
Uniform	`dunif()`	`punif()`	`qunif()`	`runif()`
Exponential	`dexp()`	`pexp()`	`qexp()`	`rexp()`
$\chi^2$	`dchisq()`	`pchisq()`	`qchisq()`	`rchisq()`
Binomial	`dbinom()`	`pbinom()`	`qbinom()`	`rbinom()`
Poisson	`dpois()`	`ppois()`	`qpois()`	`rpois()`

Distributions

Also Beta, $\Gamma$, Log-Normal, F, Geometric, Cauchy, Hypergeometric etc…

?Distributions

Distributions

## dnorm gives the classic bell-curve
tibble(
  x = seq(-4, 4, length.out = 1e3)
) |> 
  ggplot(aes(x, y = dnorm(x))) + 
  geom_line(colour = "red")

## pnorm gives the area under the 
## bell-curve (which sums to 1)
tibble(
  x = seq(-4, 4, length.out = 1e3)
) |> 
  ggplot(aes(x, y = pnorm(x))) + 
  geom_line()

The T Distribution

A T distribution looks very much like a Standard normal N(0, 1) but has heavier tails
This allows for greater uncertainty in the tails

Tests For Continuous Data

Data For This Session

We’ll use the pigs dataset from earlier
Start a new session with new script: BasicStatistics.R

library(tidyverse)
library(scales)
library(car)
theme_set(
  theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)
pigs <- file.path("data", "pigs.csv") |>
    read_csv() |>
    mutate(
      dose = fct(dose, levels = c("Low", "Med", "High")),
      supp = fct(supp)
    )

Data For This Session

pigs |> 
  ggplot(
    aes(x = dose, y = len, fill = supp)
  ) +
    geom_boxplot()

Pop Quiz

Can anyone define a p-value?

A p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
In plain English: If there’s nothing really going on, how likely are we to observe our result, or one even more extreme?
A p-value of 0.05 $\implies$ about 1 in 20 times we’ll see something like this in a random sample

t-tests

Assumes normally distributed data
$t$-tests always test $H_0$ Vs $H_A$
- For data with exactly two groups

The simplest test is on a simple vector
- Not particularly meaningful for our data

?t.test
t.test(some_vector)

What is $H_0$ in the above test?

The true mean of the underlying distribution from which the vector is sampled, is zero: i.e. $\mu = 0$

t-tests

When comparing the means of two vectors

\[ H_0: \mu_{1} = \mu_{2} \\ H_A: \mu_{1} \neq \mu_{2} \]

We could use two vectors (i.e. x & y)

vc <- dplyr::filter(pigs, supp == "VC")$len
oj <- dplyr::filter(pigs, supp == "OJ")$len
t.test(x = vc, y = oj)

Is This a Paired Test?

t-tests

An alternative is the R formula method: len~supp
- Length is a response variable
- Supplement is the predictor
Can only use one predictor for a T-test $\implies$ otherwise it’s linear regression

t.test(len~supp, data = pigs)

Did this give the same results?

t-tests

Do we think the variance is equal between the two groups?

pigs |> summarise(sd = sd(len), .by = supp)

# A tibble: 2 × 2
  supp     sd
  <fct> <dbl>
1 VC     8.27
2 OJ     6.61

We can use Levene’s Test to formalise this
- From the package car
- Bartlett’s test is very similar (bartlett.test())

leveneTest(len~supp, data = pigs)

t-tests

Now we can assume equal variances
- By default, variances are assumed to be unequal

t.test(len~supp, data = pigs, var.equal = TRUE)

If relevant, the confidence interval can also be adjusted

Wilcoxon Tests

We assumed the above dataset was normally distributed:
What if it’s not?

Non-parametric equivalent is the Wilcoxon Rank-Sum Test (aka Mann-Whitney)

This assigns ranks to each value based on their value
- The test is then performed on ranks NOT the values
- Tied values can be problematic
Test that the centre of each underlying distribution is the same

wilcox.test(len~supp, data = pigs)

A Brief Comment

Both of these are suitable for comparing two groups
T-tests assume Normally Distributed Data underlies the random sample
- Are robust to some deviation from normality
- Data can sometimes be transformed (e.g. sqrt(), log() etc)

The Wilcoxon Rank Sum Test assumes nothing about the underlying distribution
- Much less powerful with small sample sizes
- Highly comparable at n $\geq$ 30
The package coin implements a range of non-parametric tests

Tests For Categorical Data

$\chi^2$ Test

Here we need counts and categories
Commonly used in Observed Vs Expected

\[H_0: \text{No association between groups and outcome}\] \[H_A: \text{Association between groups and outcome}\]

When we shouldn’t use a $\chi^2$ test?

When expected cell values are > 5 (Cochran 1954)

$\chi^2$ Test

pass <- matrix(
  c(25, 8, 6, 15), nrow = 2, 
  dimnames = list(
    c("Attended", "Skipped"), 
    c("Pass", "Fail"))
)
pass

         Pass Fail
Attended   25    6
Skipped     8   15

pass_chisq <- chisq.test(pass)
pass_chisq


    Pearson's Chi-squared test with Yates' continuity correction

data:  pass
X-squared = 9.8359, df = 1, p-value = 0.001711

Fisher’s Exact Test

$\chi^2$ tests became popular in the days of the printed tables
- We now have computers
Fisher’s Exact Test is preferable in the cases of low cell counts
- (Or any other time you feel like it…)
Same $H_0$ as the $\chi^2$ test
Uses the hypergeometric distribution

fisher.test(pass)

Summary of Tests

t.test(), wilcox.test()
chisq.test(), fisher.test()

shapiro.test(), bartlett.test()
car::leveneTest()
- Tests for normality or homogeneity of variance

binomial.test(), poisson.test()
kruskal.test(), ks.test()

`htest` Objects

All produce objects of class htest
Is really a list
- Use names() to see what other values are returned

names(pass_chisq)

[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
[7] "expected"  "residuals" "stdres"

Will vary slightly between tests
Can usually extract p-values using test$p.value

pass_chisq$p.value

[1] 0.001711398

`htest` Objects

## Have a look at the list elements produced by fisher.test
fisher.test(pass) |> names()

[1] "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
[6] "method"      "data.name"

## Are these similar to those produced by t.test?
t.test(len~supp, data = pigs) |> names()

 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

There is a function print.htest() which organises the printout for us

We’ll come back to methods later, but this is a common way for output to be produced
The will be a print function for objects of each class, i.e. print.class_of_object

References

Cochran, William G. 1954. “Some Methods for Strengthening the Common χ 2 Tests.” Biometrics 10 (4): 417.

Basic Statistics in R

Statistics in R

Introduction

Distributions

Distributions

Distributions

Distributions

The T Distribution

Tests For Continuous Data

Data For This Session

Data For This Session

Pop Quiz

t-tests

t-tests

t-tests

t-tests

t-tests

Wilcoxon Tests

A Brief Comment

Tests For Categorical Data

\(\chi^2\) Test

\(\chi^2\) Test

Fisher’s Exact Test

Summary of Tests

`htest` Objects

`htest` Objects

References

Statistics in R

Introduction

Distributions

Distributions

Distributions

Distributions

The T Distribution

Tests For Continuous Data

Data For This Session

Data For This Session

Pop Quiz

t-tests

t-tests

t-tests

t-tests

t-tests

Wilcoxon Tests

A Brief Comment

Tests For Categorical Data

\(\chi^2\) Test

\(\chi^2\) Test

Fisher’s Exact Test

Summary of Tests

htest Objects

htest Objects

References

`htest` Objects

`htest` Objects