Extending Data Frame Manipulation

RAdelaide 2024

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 9, 2024

Using tidyr

The package tidyr

  • Is the next logical step from dplyr
  • Instead of working with data within columns
    \(\implies\) manipulate entire columns
  • Analogous to some pivot table operations in Excel
  • Also a core tidyverse package

. . .

  • Start a new script: TidyR.R

A New Dataset

  • For this we’ll use the PCR data we glimpsed earlier 1
library(tidyverse)
library(readxl)
pcr <- read_excel("data/RealTimeData.xlsx") %>% 
  rename(sample = `...1`)
  • rename is another dplyr function
  • The default empty column name ..1 is a bit weird
    • Starting ... makes it a bit special
    • Needs to be enclosed within backticks

A New Dataset

  • These values represent Ct values from a RT-PCR analysis
    • Ct is the PCR cycle at which the machine detects ‘signal’
    • Low numbers mean more of the RNA molecule
  • Two genes: SLAM & GAPDH
    • GAPDH is a housekeeper gene
    • Should be consistent across treatments, conditions etc

A New Dataset

  • The sample column also contains two pieces of information
    • C/INF probably mean control or infected cells
    • Timepoints are 24h, 72h or 120h
  • Looks like some data entry errors

. . .

  • How to separate the cell group (C/INF) from the timepoint?
  • If wanting to make boxplots, how would we fill by gene
    • Should have a column with gene names
    • Another with Ct values

Cleaning Data Entry Errors

  • There are multiple ways & steps to clean-up the sample column
    • Need to add spaces where they are missing
    • Place the hour annotations in a consistent format

. . .

  • My strategy would be to use str_replace_all()
    1. Add the missing spaces
    2. Change the hour format
  • We could use two captures for this

Combining Patterns

  • The key will be what we place between the captures
  • The pattern I chose was " *" \(\implies\) match zero or more spaces
pcr %>% 
  mutate(
    sample = str_replace_all(
      sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr"
    )
  )
# A tibble: 18 × 3
   sample     SLAM GAPDH
   <chr>     <dbl> <dbl>
 1 C 24hr     23.9  17.7
 2 C 24hr     23.5  17.4
 3 C 24hr     23.7  17.8
 4 INF 24hr   23.4  15.7
 5 INF 24hr   23.3  15.4
 6 INF 24hr   23.4  15.8
 7 C 72hr     24.7  18.4
 8 C 72hr     24.3  18.6
 9 C 72hr     24.5  18.7
10 INF 72hr   23.4  15.9
11 INF 72hr   23.3  15.8
12 INF 72hr   23.2  15.4
13 C 120hr    24.4  18.5
14 C 120hr    24.7  18.2
15 C 120hr    25.1  18.3
16 INF 120hr  22.8  16.7
17 INF 120hr  22.8  17.0
18 INF 120hr  22.5  16.5
  • Notice we didn’t capture the optional space

Separating Columns

  • Now we’ve cleaned up the data-entry erors
    \(\implies\) use separate() to split into two columns
    • Will automatically separate using any non-alpha-numeric characters
pcr %>% 
  mutate(
    sample = str_replace_all(
      sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr"
    )
  ) %>% 
  separate(sample, into = c("treatment", "timepoint"))
# A tibble: 18 × 4
   treatment timepoint  SLAM GAPDH
   <chr>     <chr>     <dbl> <dbl>
 1 C         24hr       23.9  17.7
 2 C         24hr       23.5  17.4
 3 C         24hr       23.7  17.8
 4 INF       24hr       23.4  15.7
 5 INF       24hr       23.3  15.4
 6 INF       24hr       23.4  15.8
 7 C         72hr       24.7  18.4
 8 C         72hr       24.3  18.6
 9 C         72hr       24.5  18.7
10 INF       72hr       23.4  15.9
11 INF       72hr       23.3  15.8
12 INF       72hr       23.2  15.4
13 C         120hr      24.4  18.5
14 C         120hr      24.7  18.2
15 C         120hr      25.1  18.3
16 INF       120hr      22.8  16.7
17 INF       120hr      22.8  17.0
18 INF       120hr      22.5  16.5

Notice that we used a character vector to provide the new colnames

Separating Columns

  • These look like categorical values \(\implies\) factor()
pcr %>% 
  mutate(
    sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
  ) %>% 
  separate(sample, into = c("treatment", "timepoint")) %>% 
  mutate(
    timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
    treatment = factor(treatment, levels = c("C", "INF"))
  )

Adding Replicate Information

  • We don’t really know which replicates may be related but
    • using the .by argument nests the data invisibly
    • This is why it was done in a separate call to mutate()
pcr %>% 
  mutate(
    sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
  ) %>% 
  separate(sample, into = c("treatment", "timepoint")) %>% 
  mutate(
    timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
    treatment = factor(treatment, levels = c("C", "INF")) 
  ) %>% 
  mutate(
    replicate = seq_along(SLAM), .by = c(treatment, timepoint)
  )

. . .


I would usually do this when I load the data

Rewrite the Loading Steps

  • Re-load the data to make our life easier from here on
pcr <- read_excel("data/RealTimeData.xlsx") %>% 
  rename(sample = `...1`) %>% 
  mutate(
    sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
  ) %>% 
  separate(sample, into = c("treatment", "timepoint")) %>% 
  mutate(
    timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
    treatment = factor(treatment, levels = c("C", "INF")) 
  ) %>% 
  mutate(
    replicate = seq_along(SLAM), .by = c(treatment, timepoint)
  )

Merging Columns

  • The converse to separate() is unite()
    • The below shows manually setting the separator
    • Also tells unite() not to remove the original columns
pcr %>% 
  unite(
    col = "merged", all_of(c("treatment","timepoint", "replicate")),
    sep = ":", remove = FALSE
  ) %>% 
  head()
# A tibble: 6 × 6
  merged     treatment timepoint  SLAM GAPDH replicate
  <chr>      <fct>     <fct>     <dbl> <dbl>     <int>
1 C:24hr:1   C         24hr       23.9  17.7         1
2 C:24hr:2   C         24hr       23.5  17.4         2
3 C:24hr:3   C         24hr       23.7  17.8         3
4 INF:24hr:1 INF       24hr       23.4  15.7         1
5 INF:24hr:2 INF       24hr       23.3  15.4         2
6 INF:24hr:3 INF       24hr       23.4  15.8         3
  • all_of() is another helper related to any_of()
  • Takes a character vector

Long Form

  • If we want to plot the values of SLAM & GAPDH \(\implies\) pivot_longer()
    • Colloquially known as long form
    • Very helpful for regression models & visualisation in R
pcr %>% 
  pivot_longer(
    cols = all_of(c("SLAM", "GAPDH")), names_to = "gene", values_to = "Ct"
  )
# A tibble: 36 × 5
   treatment timepoint replicate gene     Ct
   <fct>     <fct>         <int> <chr> <dbl>
 1 C         24hr              1 SLAM   23.9
 2 C         24hr              1 GAPDH  17.7
 3 C         24hr              2 SLAM   23.5
 4 C         24hr              2 GAPDH  17.4
 5 C         24hr              3 SLAM   23.7
 6 C         24hr              3 GAPDH  17.8
 7 INF       24hr              1 SLAM   23.4
 8 INF       24hr              1 GAPDH  15.7
 9 INF       24hr              2 SLAM   23.3
10 INF       24hr              2 GAPDH  15.4
# ℹ 26 more rows

Visualising The Data

  • Now we can plot each gene by timepoint and treatment
    • Faceting by gene
pcr %>% 
  pivot_longer(
    cols = all_of(c("SLAM", "GAPDH")), names_to = "gene", values_to = "Ct"
  ) %>% 
  ggplot(aes(timepoint, Ct, fill = treatment)) +
  geom_boxplot() +
  facet_wrap(~gene, scales = "free_y")

From Long To Wide

  • The alternative is pivot_wider()
    • Not so great for plotting
    • Can be useful for tables
pcr %>% 
  mutate(dCt = SLAM - GAPDH) %>% 
  pivot_wider(
    names_from = "replicate", values_from = "dCt",
    id_cols = starts_with("t"), names_prefix = "rep"
  ) %>% 
  arrange(treatment, timepoint)
# A tibble: 6 × 5
  treatment timepoint  rep1  rep2  rep3
  <fct>     <fct>     <dbl> <dbl> <dbl>
1 C         24hr       6.17  6.04  5.92
2 C         72hr       6.35  5.74  5.83
3 C         120hr      5.88  6.46  6.79
4 INF       24hr       7.72  7.9   7.6 
5 INF       72hr       7.5   7.55  7.79
6 INF       120hr      6.08  5.79  6.02

Closing Comments

  • The tidyverse is under constant development
    • gather() \(\implies\) pivot_longer()
    • spread() \(\implies\) pivot_wider()

. . .

  • separate() is now superseded
    • separate_wider_delim()

Footnotes

  1. source unknown↩︎