Reshaping And Tidying Data

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 8, 2025

Manipulating Tibbles with `tidyr`

Motivation

In Excel many people use pivot tables
Sometimes we need similar operations in R
Data may need to be in different ‘shapes’ for plotting/tables/analysis

Often we refer to data as being in long form or in wide form
In wide form our response variable may be spread across multiple columns
- Often multiple measures for one individual
In long form, the response variable is in a single column

The package `tidyr`

We’ll look at some key functions from tidyr
- pivot_longer() and pivot_wider()
- separate() and unite()
- Many more beyond the scope of today

Add library(tidyr) to the beginning of the script
penguins has each measurement in long form
\(\implies\) has multiple columns of measurement

From Long To Wide

We’ll first create some summaries by sex

penguins |>
  filter(
    !is.na(sex) # Exclude those where sex is missing
  ) |> 
  count(species, sex)

From Long To Wide

Let’s try and create separate columns for male & female penguins
This is where pivot_wider() can be helpful

penguins |>
  filter(
    !is.na(sex) # Exclude those where sex is missing
  ) |> 
  count(species, sex) |>
  pivot_wider(
    # Specify the column we wish to take the new names from
    # And the column with the values
    names_from = sex, values_from = n
  )

From Long To Wide

penguins |>
  filter(
    !is.na(sex) # Exclude those where sex is missing
  ) |> 
  count(species, year, sex) |>
  pivot_wider(
    # Multiple columns can also be chosen
    # The default separator will be "_"
    names_from = c(sex, year), values_from = n
  )

From Wide To Long

The reverse operation is pivot_longer()
This time we move values from multiple columns into a single column
- Notice that the new column names need to be quoted here

penguins |>
  # Place all the length and depth measurements into a single column
  pivot_longer(
    cols = ends_with("mm"),
    names_to = "measurement",
    values_to = "mm"
  )

From Wide To Long

Again, summarisation shows this to be very helpful

penguins |>
  pivot_longer(
    cols = ends_with("mm"),
    names_to = "measurement",
    values_to = "mm"
  ) |>
  filter(!is.na(mm)) |>
  summarise(
    # Find the mean of each measurement type by year
    mean_mm = mean(mm), .by = c(year, measurement)
  )

Using Both Together

penguins |>
  pivot_longer(
    cols = ends_with("mm"),
    names_to = "measurement",
    values_to = "mm"
  ) |>
  filter(!is.na(mm)) |>
  summarise(
    mean_mm = mean(mm), .by = c(year, measurement)
  ) |>
  # Now form separate columns for each year
  pivot_wider(names_from = year, values_from = mean_mm)

Combining Columns

## We can combine multiple columns using unite
## By default, the separator will be '_'
penguins |>
  unite(new_column, species, sex)

## Original columns can be left if preferred
## An alternative separator can also be provided
penguins |>
  unite(new_column, species, sex, remove = FALSE, sep = ":")

Splitting Columns

The reverse operation is separate()
We’ll create our own tibble on the fly for this example

tibble(merged = c("A.1", "B.1", "C.2"))

# A tibble: 3 × 1
  merged
  <chr> 
1 A.1   
2 B.1   
3 C.2

Splitting Columns

We can now use the dot to separate
Number of expected columns is set by the into argument

tibble(merged = c("A.1", "B.1", "C.2"))  |>
  separate(merged, into = c("letter", "number"))

# A tibble: 3 × 2
  letter number
  <chr>  <chr> 
1 A      1     
2 B      1     
3 C      2

Splitting Columns

Setting The Wrong Number of Columns

Try adding an additional expected column called “none”
This is a common error you may see
Also try setting “letter” as the only column

Splitting Columns

Check the help page ?separate

sep = "[^[:alnum:]]+" is a regular expression
Common syntax for working with flexible text strings
Easy to write \(\rightarrow\) horrible to read back
- Explore tomorrow \(\implies\) essential for handling text/categorical data

Summary

Covered quite some turf
Switch from wide to long
Splitting/Merging Columns
Selecting/Renaming Columns
Adding Columns
Summarising Columns
Filtering/Slicing Rows

Summary

All important preparatory skills for:
- Cleaning your data
- Summary tables
- Plotting Data
- Running Statistical Tests

Still need:
- Basic R Data Structures
- Text manipulation (a key skill)

Challenges

Initial Challenges

Count the number of penguins from each species on each island for each year. Try and make it look like the output below
- Hint: use values_fill = 0 to get rid of missing values in the output

# A tibble: 9 × 5
  island     year Adelie Chinstrap Gentoo
  <fct>     <int>  <int>     <int>  <int>
1 Biscoe     2007     10         0     34
2 Biscoe     2008     18         0     46
3 Biscoe     2009     16         0     44
4 Dream      2007     20        26      0
5 Dream      2008     16        18      0
6 Dream      2009     20        24      0
7 Torgersen  2007     20         0      0
8 Torgersen  2008     16         0      0
9 Torgersen  2009     16         0      0

Add a column called total which adds the values for all species columns
Subset the table to only show values from 2009
Combine the island and year to be a column in the form island:year

More Challenges

Create the following table showing the mean bill_length_mm
- Decide how to handle missing values

# A tibble: 5 × 5
  species   island    `2007` `2008` `2009`
  <fct>     <fct>      <dbl>  <dbl>  <dbl>
1 Adelie    Torgersen   38.8   38.8   39.3
2 Adelie    Biscoe      38.3   38.7   39.7
3 Adelie    Dream       39.1   38.2   38.2
4 Gentoo    Biscoe      47.0   46.9   48.5
5 Chinstrap Dream       48.7   48.7   49.1

Add a column called overall_mean which averages the values for all years
- Hint: The mean is the sum divided by the number of values

Reshaping And Tidying Data

Manipulating Tibbles with tidyr

Motivation

The package tidyr

From Long To Wide

From Long To Wide

From Long To Wide

From Wide To Long

From Wide To Long

Using Both Together

Combining Columns

Splitting Columns

Splitting Columns

Splitting Columns

Splitting Columns

Summary

Summary

Challenges

Initial Challenges

More Challenges

Manipulating Tibbles with `tidyr`

The package `tidyr`