Reshaping And Tidying Data

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 8, 2025

Manipulating Tibbles with tidyr

Motivation

  • In Excel many people use pivot tables
  • Sometimes we need similar operations in R
  • Data may need to be in different ‘shapes’ for plotting/tables/analysis
  • Often we refer to data as being in long form or in wide form
  • In wide form our response variable may be spread across multiple columns
    • Often multiple measures for one individual
  • In long form, the response variable is in a single column

The package tidyr

  • We’ll look at some key functions from tidyr
    • pivot_longer() and pivot_wider()
    • separate() and unite()
    • Many more beyond the scope of today
  • Add library(tidyr) to the beginning of the script
  • penguins has each measurement in long form
    \(\implies\) has multiple columns of measurement

From Long To Wide

  • We’ll first create some summaries by sex
penguins |>
  filter(
    !is.na(sex) # Exclude those where sex is missing
  ) |> 
  count(species, sex)

From Long To Wide

  • Let’s try and create separate columns for male & female penguins
  • This is where pivot_wider() can be helpful
penguins |>
  filter(
    !is.na(sex) # Exclude those where sex is missing
  ) |> 
  count(species, sex) |>
  pivot_wider(
    # Specify the column we wish to take the new names from
    # And the column with the values
    names_from = sex, values_from = n
  )

From Long To Wide

penguins |>
  filter(
    !is.na(sex) # Exclude those where sex is missing
  ) |> 
  count(species, year, sex) |>
  pivot_wider(
    # Multiple columns can also be chosen
    # The default separator will be "_"
    names_from = c(sex, year), values_from = n
  )

From Wide To Long

  • The reverse operation is pivot_longer()
  • This time we move values from multiple columns into a single column
    • Notice that the new column names need to be quoted here
penguins |>
  # Place all the length and depth measurements into a single column
  pivot_longer(
    cols = ends_with("mm"),
    names_to = "measurement",
    values_to = "mm"
  )

From Wide To Long

  • Again, summarisation shows this to be very helpful
penguins |>
  pivot_longer(
    cols = ends_with("mm"),
    names_to = "measurement",
    values_to = "mm"
  ) |>
  filter(!is.na(mm)) |>
  summarise(
    # Find the mean of each measurement type by year
    mean_mm = mean(mm), .by = c(year, measurement)
  ) 

Using Both Together

penguins |>
  pivot_longer(
    cols = ends_with("mm"),
    names_to = "measurement",
    values_to = "mm"
  ) |>
  filter(!is.na(mm)) |>
  summarise(
    mean_mm = mean(mm), .by = c(year, measurement)
  ) |>
  # Now form separate columns for each year
  pivot_wider(names_from = year, values_from = mean_mm)

Combining Columns

## We can combine multiple columns using unite
## By default, the separator will be '_'
penguins |>
  unite(new_column, species, sex)


## Original columns can be left if preferred
## An alternative separator can also be provided
penguins |>
  unite(new_column, species, sex, remove = FALSE, sep = ":")

Splitting Columns

  • The reverse operation is separate()
  • We’ll create our own tibble on the fly for this example
tibble(merged = c("A.1", "B.1", "C.2")) 
# A tibble: 3 × 1
  merged
  <chr> 
1 A.1   
2 B.1   
3 C.2   

Splitting Columns

  • We can now use the dot to separate
  • Number of expected columns is set by the into argument
tibble(merged = c("A.1", "B.1", "C.2"))  |>
  separate(merged, into = c("letter", "number"))
# A tibble: 3 × 2
  letter number
  <chr>  <chr> 
1 A      1     
2 B      1     
3 C      2     

Splitting Columns

Setting The Wrong Number of Columns

  • Try adding an additional expected column called “none”
  • This is a common error you may see
  • Also try setting “letter” as the only column

Splitting Columns

  • Check the help page ?separate
  • sep = "[^[:alnum:]]+" is a regular expression
  • Common syntax for working with flexible text strings
  • Easy to write \(\rightarrow\) horrible to read back
    • Explore tomorrow \(\implies\) essential for handling text/categorical data

Summary

  • Covered quite some turf
  • Switch from wide to long
  • Splitting/Merging Columns
  • Selecting/Renaming Columns
  • Adding Columns
  • Summarising Columns
  • Filtering/Slicing Rows

Summary

  • All important preparatory skills for:
    • Cleaning your data
    • Summary tables
    • Plotting Data
    • Running Statistical Tests
  • Still need:
    • Basic R Data Structures
    • Text manipulation (a key skill)

Challenges

Initial Challenges

  1. Count the number of penguins from each species on each island for each year. Try and make it look like the output below
    • Hint: use values_fill = 0 to get rid of missing values in the output
# A tibble: 9 × 5
  island     year Adelie Chinstrap Gentoo
  <fct>     <int>  <int>     <int>  <int>
1 Biscoe     2007     10         0     34
2 Biscoe     2008     18         0     46
3 Biscoe     2009     16         0     44
4 Dream      2007     20        26      0
5 Dream      2008     16        18      0
6 Dream      2009     20        24      0
7 Torgersen  2007     20         0      0
8 Torgersen  2008     16         0      0
9 Torgersen  2009     16         0      0
  1. Add a column called total which adds the values for all species columns
  2. Subset the table to only show values from 2009
  3. Combine the island and year to be a column in the form island:year

More Challenges

  • Create the following table showing the mean bill_length_mm
    • Decide how to handle missing values
# A tibble: 5 × 5
  species   island    `2007` `2008` `2009`
  <fct>     <fct>      <dbl>  <dbl>  <dbl>
1 Adelie    Torgersen   38.8   38.8   39.3
2 Adelie    Biscoe      38.3   38.7   39.7
3 Adelie    Dream       39.1   38.2   38.2
4 Gentoo    Biscoe      47.0   46.9   48.5
5 Chinstrap Dream       48.7   48.7   49.1
  • Add a column called overall_mean which averages the values for all years
    • Hint: The mean is the sum divided by the number of values