Spreadsheet-Like Data

ASI: Introduction to R

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

September 3, 2025

Working With
Spreadsheet-Like Data

Section Setup

  • Clear your R Environment
  • Create a new script called ‘SpreadsheetLikeData.R’
  • Load the tidyverse
library(tidyverse)


We’ll cover:

  • SQL- and Excel-like functions in dplyr
  • Pivot Tables and Manipulation using tidyr
  • Chaining function together using the pipe operator

The Pipe Operator

  • In real-world analyses we often perform multiple steps
  • The classic R approach was to create a new object at each step
    • Leads to congested environments with multiple similar objects
    • e.g. data_raw, data_sorted, data_subset
  • In R v4.1.0 the pipe operator was introduced: |>
  • Takes the output of one function and connects it to the input of another
    • An assembly line \(\implies\) like taking the output of one process and sending straight to the input of another process

The Pipe Operator

  • If R ran an assembly line making toy cars
hot_plastic <- heat(plastic)
step1 <- mould(hot_plastic)
step2 <- paint(step1, "red")
step3 <- attach_wheels(step2)
final_car <- package(step3)
  • Using the |> operator this might become
final_car <- plastic |>
  heat() |>
  mould() |>
  paint("red") |>
  attach_wheels() |>
  package()

The Pipe Operator

head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
cars |> head()
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
  • We’ll use this approach for the remainder of the session

Returning To The Penguins

  • We’ll use my_penguins for this session
my_penguins <- read_csv("data/my_penguins.csv")
  • This is a tibble() (i.e. a conveniently wrapped data.frame)
  • We can easily get key attributes about these objects
dim(my_penguins)
nrow(my_penguins)
ncol(my_penguins)
colnames(my_penguins)

Using the Pipe Operator

  • We could also perform these operations the pipe (|>)
my_penguins |> colnames()
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             
  • This approach combines nicely with stringr
my_penguins |>
  colnames() |>
  str_subset(pattern = "bill")
[1] "bill_length_mm" "bill_depth_mm" 

Subsetting a Data Frame

  • We can subset a tibble using [,] just like a data.frame
my_penguins[1:3, c("species", "island", "body_mass_g")]
# A tibble: 3 × 3
  species island    body_mass_g
  <chr>   <chr>           <dbl>
1 Adelie  Torgersen        3750
2 Adelie  Torgersen        3800
3 Adelie  Torgersen        3250

Subsetting a Data Frame

  • dplyr provides two separate functions for this
  • Designed for integration with the pipe
  • select() allows for selecting columns by position or name
  • slice() allows for cutting out rows by position only
    • filter() (coming soon) allows for returning rows by a key value
## Grab the first 3 rows from the species, island and body_mass_g columns
my_penguins |>
  select(species, island, body_mass_g) |>
  slice(1:3)
# A tibble: 3 × 3
  species island    body_mass_g
  <chr>   <chr>           <dbl>
1 Adelie  Torgersen        3750
2 Adelie  Torgersen        3800
3 Adelie  Torgersen        3250

Subsetting a Data Frame

  • An advantage of this approach is the use of helpers
    • starts_with(), ends_with()
    • contains(), matches()
    • all_of(), any_of()
    • everything()
    • where()
## Just return the columns species and island, along with thos measured in mm
my_penguins |>
  select(species, island, ends_with("mm"))

Subsetting a Data Frame

## Just return the columns species & island along with bill measurements
my_penguins |>
  select(species, island, contains("bill"))


## Return the columns which are character vectors then place all other columns 
## after them. `where()` applies the logical test `is.character()` to each column
my_penguins |>
  select(where(is.character), everything())

Subsetting a Data Frame

  • Column names above haven’t been quoted
  • any_of() and all_of() take character vectors
## `all_of` requires all columns to exist
my_penguins |>
  select(all_of(c("species", "island", "sex")))


## `any_of` will ignore any columns that don't exist
my_penguins |>
  select(any_of(c("species", "island", "sex", "nonsense")))

Excel-Like Functions in dplyr

  • Two common tasks in Excel are 1) sorting a column and 2) applying a filter
  • In dplyr the relevant functons are arrange() and filter()
## Show the penguins in order of weight. This will be ascending by default
my_penguins |>
  arrange(body_mass_g)
# A tibble: 333 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>     <chr>             <dbl>         <dbl>             <dbl>       <dbl>
 1 Chinstrap Dream              46.9          16.6               192        2700
 2 Adelie    Biscoe             36.5          16.6               181        2850
 3 Adelie    Biscoe             36.4          17.1               184        2850
 4 Adelie    Biscoe             34.5          18.1               187        2900
 5 Adelie    Dream              33.1          16.1               178        2900
 6 Adelie    Torgers…           38.6          17                 188        2900
 7 Chinstrap Dream              43.2          16.6               187        2900
 8 Adelie    Biscoe             37.9          18.6               193        2925
 9 Adelie    Dream              37            16.9               185        3000
10 Adelie    Dream              37.3          16.8               192        3000
# ℹ 323 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Excel-Like Functions in dplyr

  • To sort in descending order, we wrap the column name inside desc()
## Show the penguins in descending order of weight
my_penguins |>
  arrange(desc(body_mass_g))
# A tibble: 333 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
 1 Gentoo  Biscoe           49.2          15.2               221        6300
 2 Gentoo  Biscoe           59.6          17                 230        6050
 3 Gentoo  Biscoe           51.1          16.3               220        6000
 4 Gentoo  Biscoe           48.8          16.2               222        6000
 5 Gentoo  Biscoe           45.2          16.4               223        5950
 6 Gentoo  Biscoe           49.8          15.9               229        5950
 7 Gentoo  Biscoe           48.4          14.6               213        5850
 8 Gentoo  Biscoe           49.3          15.7               217        5850
 9 Gentoo  Biscoe           55.1          16                 230        5850
10 Gentoo  Biscoe           49.5          16.2               229        5800
# ℹ 323 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Applying a Filter

  • We can subset using filter()
## Show only the penguins heavier than 4kg
my_penguins |>
  filter(body_mass_g > 4000)
# A tibble: 167 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.2          19.6               195        4675
 2 Adelie  Torgersen           34.6          21.1               198        4400
 3 Adelie  Torgersen           42.5          20.7               197        4500
 4 Adelie  Torgersen           46            21.5               194        4200
 5 Adelie  Dream               39.2          21.1               196        4150
 6 Adelie  Dream               39.8          19.1               184        4650
 7 Adelie  Dream               44.1          19.7               196        4400
 8 Adelie  Dream               39.6          18.8               190        4600
 9 Adelie  Dream               42.3          21.2               191        4150
10 Adelie  Biscoe              40.1          18.9               188        4300
# ℹ 157 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Applying a Filter

  • filter() relies on logical tests being performed on the column of interest
Symbol Meaning
== Is equal to
!= Is NOT equal to
>= / <= Is greater/less than or equal to
> / < Is greater/less than
%in% Is IN a set of values
is.na() Is a missing value (i.e. NA)
  • In most languages, ! means NOT

Applying a Filter

## Show only the Chinstrap penguins heavier than 4kg
my_penguins |>
  filter(species == "Chinstrap", body_mass_g > 4000)
# A tibble: 15 × 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>     <chr>           <dbl>         <dbl>             <dbl>       <dbl>
 1 Chinstrap Dream            46            18.9               195        4150
 2 Chinstrap Dream            52            18.1               201        4050
 3 Chinstrap Dream            50.5          19.6               201        4050
 4 Chinstrap Dream            49.2          18.2               195        4400
 5 Chinstrap Dream            52            19                 197        4150
 6 Chinstrap Dream            52.8          20                 205        4550
 7 Chinstrap Dream            54.2          20.8               201        4300
 8 Chinstrap Dream            51            18.8               203        4100
 9 Chinstrap Dream            52            20.7               210        4800
10 Chinstrap Dream            53.5          19.9               205        4500
11 Chinstrap Dream            50.8          18.5               201        4450
12 Chinstrap Dream            49            19.6               212        4300
13 Chinstrap Dream            50.7          19.7               203        4050
14 Chinstrap Dream            49.3          19.9               203        4050
15 Chinstrap Dream            50.8          19                 210        4100
# ℹ 2 more variables: sex <chr>, year <dbl>

Applying a Filter

## Show the Adelie or Chinstrap penguins
my_penguins |>
  filter(species %in% c("Adelie", "Chinstrap"))
# A tibble: 214 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Applying a Filter

## Alternatively, exclude the Gentoo penguins. This does the same as previously
my_penguins |>
  filter(species != "Gentoo")
# A tibble: 214 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Applying a Filter

  • str_detect() can also play a role here
## Match the patterns Ade or Chin to show only the Adelie or Chinstrap penguins
my_penguins |>
  filter(str_detect(species, "Ade|Chin"))
# A tibble: 214 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Combining Functions

## Match the patterns Ade or Chin to show only the Adelie or Chinstrap penguins
## Show the results in decreasing order of weight
my_penguins |>
  filter(str_detect(species, "Ade|Chin")) |>
  arrange(desc(body_mass_g))
# A tibble: 214 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>     <chr>             <dbl>         <dbl>             <dbl>       <dbl>
 1 Chinstrap Dream              52            20.7               210        4800
 2 Adelie    Biscoe             43.2          19                 197        4775
 3 Adelie    Biscoe             41            20                 203        4725
 4 Adelie    Torgers…           42.9          17.6               196        4700
 5 Adelie    Torgers…           39.2          19.6               195        4675
 6 Adelie    Dream              39.8          19.1               184        4650
 7 Adelie    Dream              39.6          18.8               190        4600
 8 Adelie    Biscoe             45.6          20.3               191        4600
 9 Chinstrap Dream              52.8          20                 205        4550
10 Adelie    Torgers…           42.5          20.7               197        4500
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Summarising Data

  • Creating tables of counts is simple \(\implies\) just pass the column names to count()
my_penguins |>
  count(species, island)
# A tibble: 5 × 3
  species   island        n
  <chr>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        55
3 Adelie    Torgersen    47
4 Chinstrap Dream        68
5 Gentoo    Biscoe      119

Creating Summary Plots

## Create a summary on the fly, followed 
## by a very barplot showing how penguins
## are distributed on the three islands
my_penguins |>
  count(species, island) |>
  ggplot(aes(species, n, fill = island)) +
  geom_col() +
  scale_fill_brewer(palette = "Set1") +
  theme_bw()

Summarising Data

  • Sometimes we need more sophisticated summaries or summary values
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    sd_weight = sd(body_mass_g)
  )
  • To show these by any category we can use the .by = argument
## Find the mean & sd of body mass by species
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    sd_weight = sd(body_mass_g),
    .by = species
  )

Summarising Data

## Create a barplot with error bars
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    sd_weight = sd(body_mass_g),
    .by = species
  ) |>
  ggplot(
    aes(species, mn_weight, fill = species)
  ) +
  geom_col() +
  geom_errorbar(
    aes(
      ymin = mn_weight - sd_weight,
      ymax = mn_weight + sd_weight
    ),
    width = 0.3
  ) +
  scale_fill_brewer(palette = "Set1") +
  theme_bw()

Modifying Columns

  • So far we’ve only selected, arranged, filtered and summarised existing columns
  • mutate() can be used to:
    1. Modify existing columns, or
    2. Add a new column
  • Do we think any of these columns are categorical
    • We could turn these into factors!!!

Modifying Columns

my_penguins |>
  mutate(
    species = fct(species),
    island = fct(island),
    ## The default order for `fct()` will be in order of appearance
    ## Set levels manually to place female first
    sex = fct(sex, levels = c("female", "male")),
    ## `fct()` requires a character vector
    ## We can also pipe internally!!!
    year = year |> as.character() |> fct()
  ) 

Have we changed the original data?

No, we’ve just changed it on the fly as we inspect it

Modifying Columns

  • I will usually modify columns like this when I’m loading the data
  • This ensures the entire workflow uses the identical dataset from the beginning
my_penguins <- read_csv("data/my_penguins.csv") |>
  mutate(
    species = fct(species),
    island = fct(island),
    ## The default order for `fct()` will be in order of appearance
    ## Set levels manually to place female first
    sex = fct(sex, levels = c("female", "male")),
    ## `fct()` requires a character vector
    ## We can also pipe internally!!!
    year = year |> as.character() |> fct()
  )

Reshaping Data with tidyr

Using tidyr

  • The data structures so far have been great for plotting
  • Generally we’ve seen data in long form
    • Each value type has it’s own column
  • Summary tables can be a little different \(\implies\) tidyr
    • Pivoting Data with pivot_longer() and pivot_wider()
    • Merging/Separating Columns with unite() and separate()
  • Earlier we found the mean & sd of the weight by species and island
    • Let’s include sex then show results in seprate columns

Changing From Long To Wide Form

## Calculate the mean weight by species, island and sex
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    .by = c(species, island, sex)
  )


## Calculate the mean weight by species, island and sex
## Then place values for male & female in separate columns
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    .by = c(species, island, sex)
  ) |>
  pivot_wider(names_from = sex, values_from = mn_weight)
  • This is a great summary table already \(\implies\) how do we include SD?

Changing From Long To Wide Form

  • Pivot wider can move multiple columns into wide format
  • The tidyselect helper ends_with() we saw earlier will work here too
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    ## This is a fitted value so standard error may be more appropriate
    se_weight = sd(body_mass_g) / sqrt(n()),
    .by = c(species, island, sex)
  ) |>
  ## Use `ends_with("weight")` to pivot the appropriate columns
  pivot_wider(names_from = sex, values_from = ends_with("weight"))

Combining Columns

  • Now we have the mean & standard error
  • How do we combine these for a nice printed table \(\implies\) unite()
  • The first argument is the name of the new column
  • Then use tidyselect helpers to select the columns to “unite”
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    se_weight = sd(body_mass_g) / sqrt(n()),
    .by = c(species, island, sex)
  ) |>
  pivot_wider(names_from = sex, values_from = ends_with("weight")) |>
  ## Include the underscore for the male columns, but not for the female columns
  unite(male, ends_with("_male"), sep = " +/- ") |>
  unite(female, ends_with("female"), sep = " +/- ") 

Are we happy with this? No. We need to round the values

Applying Functions To Multiple Columns

  • Sometimes we need to apply the same function to multiple columns
  • The function across() is perfect for this situation
    • Can be applied inside mutate() or summarise()
  • First we specify the columns using the tidyselect helpers
  • Then we provide the function to apply
  • across(.cols, .fns)

Applying Functions To Multiple Columns

  • Here we’ll call across(ends_with("male"), round) within a mutate()
my_penguins |>
  summarise(
    mn_weight = mean(body_mass_g),
    ## This is a fitted value so standard error may be more appropriate
    se_weight = sd(body_mass_g) / sqrt(n()),
    .by = c(species, island, sex)
  ) |>
  ## Use `ends_with("weight")` to pivot the appropriate columns
  pivot_wider(names_from = sex, values_from = ends_with("weight")) |>
  ## Apply the function `round()` to all four columns ending with male
  ## This will default to zero digits, which is fine for this table
  mutate(across(ends_with("male"), round)) |>
  unite(male, ends_with("_male"), sep = " +/- ") |>
  unite(female, ends_with("female"), sep = " +/- ") 

Shifting From Wide to Long Form

  • The same principles apply when using pivot_longer()
    • Won’t explore in as much detail
## Place all measurements taken in 'mm' into the same column
my_penguins |>
  pivot_longer(ends_with("mm"), names_to = "measurement", values_to = "mm") 
# A tibble: 999 × 7
   species island    body_mass_g sex    year  measurement          mm
   <fct>   <fct>           <dbl> <fct>  <fct> <chr>             <dbl>
 1 Adelie  Torgersen        3750 male   2007  bill_length_mm     39.1
 2 Adelie  Torgersen        3750 male   2007  bill_depth_mm      18.7
 3 Adelie  Torgersen        3750 male   2007  flipper_length_mm 181  
 4 Adelie  Torgersen        3800 female 2007  bill_length_mm     39.5
 5 Adelie  Torgersen        3800 female 2007  bill_depth_mm      17.4
 6 Adelie  Torgersen        3800 female 2007  flipper_length_mm 186  
 7 Adelie  Torgersen        3250 female 2007  bill_length_mm     40.3
 8 Adelie  Torgersen        3250 female 2007  bill_depth_mm      18  
 9 Adelie  Torgersen        3250 female 2007  flipper_length_mm 195  
10 Adelie  Torgersen        3450 female 2007  bill_length_mm     36.7
# ℹ 989 more rows

Separating Values Into Multiple Columns

## Place all measurements taken in 'mm' into the same column
my_penguins |>
  pivot_longer(ends_with("mm"), names_to = "measurement", values_to = "mm") |>
  ## Now separate the values into body part & measurement type
  separate(measurement, c("body_part", "measurement_type", "junk"), sep = "_") |>
  ## We can get rid of the junk column by using the `-` symbol before the column name
  dplyr::select(-junk)
# A tibble: 999 × 8
   species island    body_mass_g sex    year  body_part measurement_type    mm
   <fct>   <fct>           <dbl> <fct>  <fct> <chr>     <chr>            <dbl>
 1 Adelie  Torgersen        3750 male   2007  bill      length            39.1
 2 Adelie  Torgersen        3750 male   2007  bill      depth             18.7
 3 Adelie  Torgersen        3750 male   2007  flipper   length           181  
 4 Adelie  Torgersen        3800 female 2007  bill      length            39.5
 5 Adelie  Torgersen        3800 female 2007  bill      depth             17.4
 6 Adelie  Torgersen        3800 female 2007  flipper   length           186  
 7 Adelie  Torgersen        3250 female 2007  bill      length            40.3
 8 Adelie  Torgersen        3250 female 2007  bill      depth             18  
 9 Adelie  Torgersen        3250 female 2007  flipper   length           195  
10 Adelie  Torgersen        3450 female 2007  bill      length            36.7
# ℹ 989 more rows

Closing Comments

The Palmer Penguins

  • The data we’ve been working with comes from the package palmerpenguins
    • https://allisonhorst.github.io/palmerpenguins/index.html
  • my_penguins had all NA values removed for convenience
library(palmerpenguins)
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Function Conflicts

  • Functions in dplyr are often based on SQL functions
    • dplyr was first released in 2014 1
  • Some other packages already contained a select(), filter() etc
    • Especially those with a SQL back-end

Function Conflicts

  • If a dplyr function gives a strange error
    \(\implies\) probably loaded another package with a function of the same name
  • Common cases are select(), filter(), slice(), count(), n()
  • The solution is to call explicitly from dplyr using the namespace
    • e.g. dplyr::select(), dplyr::filter(), dplyr::count() etc
    • The namespace is the formal title for all function names within a package
  • The package conflicted also enables setting of which function to use