Extending Data Frame Manipulation

RAdelaide 2024

Author

Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 9, 2024

Using `tidyr`

The package `tidyr`

Is the next logical step from dplyr
Instead of working with data within columns
\(\implies\) manipulate entire columns
Analogous to some pivot table operations in Excel
Also a core tidyverse package

. . .

Start a new script: TidyR.R

A New Dataset

For this we’ll use the PCR data we glimpsed earlier ¹

library(tidyverse)
library(readxl)
pcr <- read_excel("data/RealTimeData.xlsx") %>% 
  rename(sample = `...1`)

rename is another dplyr function
The default empty column name ..1 is a bit weird
- Starting ... makes it a bit special
- Needs to be enclosed within backticks

A New Dataset

These values represent Ct values from a RT-PCR analysis
- Ct is the PCR cycle at which the machine detects ‘signal’
- Low numbers mean more of the RNA molecule
Two genes: SLAM & GAPDH
- GAPDH is a housekeeper gene
- Should be consistent across treatments, conditions etc

A New Dataset

The sample column also contains two pieces of information
- C/INF probably mean control or infected cells
- Timepoints are 24h, 72h or 120h
Looks like some data entry errors

. . .

How to separate the cell group (C/INF) from the timepoint?
If wanting to make boxplots, how would we fill by gene
- Should have a column with gene names
- Another with Ct values

Cleaning Data Entry Errors

There are multiple ways & steps to clean-up the sample column
- Need to add spaces where they are missing
- Place the hour annotations in a consistent format

. . .

My strategy would be to use str_replace_all()
1. Add the missing spaces
2. Change the hour format
We could use two captures for this

Combining Patterns

The key will be what we place between the captures
The pattern I chose was " *" \(\implies\) match zero or more spaces

pcr %>% 
  mutate(
    sample = str_replace_all(
      sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr"
    )
  )

# A tibble: 18 × 3
   sample     SLAM GAPDH
   <chr>     <dbl> <dbl>
 1 C 24hr     23.9  17.7
 2 C 24hr     23.5  17.4
 3 C 24hr     23.7  17.8
 4 INF 24hr   23.4  15.7
 5 INF 24hr   23.3  15.4
 6 INF 24hr   23.4  15.8
 7 C 72hr     24.7  18.4
 8 C 72hr     24.3  18.6
 9 C 72hr     24.5  18.7
10 INF 72hr   23.4  15.9
11 INF 72hr   23.3  15.8
12 INF 72hr   23.2  15.4
13 C 120hr    24.4  18.5
14 C 120hr    24.7  18.2
15 C 120hr    25.1  18.3
16 INF 120hr  22.8  16.7
17 INF 120hr  22.8  17.0
18 INF 120hr  22.5  16.5

Notice we didn’t capture the optional space

Separating Columns

Now we’ve cleaned up the data-entry erors
\(\implies\) use separate() to split into two columns
- Will automatically separate using any non-alpha-numeric characters

pcr %>% 
  mutate(
    sample = str_replace_all(
      sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr"
    )
  ) %>% 
  separate(sample, into = c("treatment", "timepoint"))

# A tibble: 18 × 4
   treatment timepoint  SLAM GAPDH
   <chr>     <chr>     <dbl> <dbl>
 1 C         24hr       23.9  17.7
 2 C         24hr       23.5  17.4
 3 C         24hr       23.7  17.8
 4 INF       24hr       23.4  15.7
 5 INF       24hr       23.3  15.4
 6 INF       24hr       23.4  15.8
 7 C         72hr       24.7  18.4
 8 C         72hr       24.3  18.6
 9 C         72hr       24.5  18.7
10 INF       72hr       23.4  15.9
11 INF       72hr       23.3  15.8
12 INF       72hr       23.2  15.4
13 C         120hr      24.4  18.5
14 C         120hr      24.7  18.2
15 C         120hr      25.1  18.3
16 INF       120hr      22.8  16.7
17 INF       120hr      22.8  17.0
18 INF       120hr      22.5  16.5

Notice that we used a character vector to provide the new colnames

Separating Columns

These look like categorical values \(\implies\) factor()

pcr %>% 
  mutate(
    sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
  ) %>% 
  separate(sample, into = c("treatment", "timepoint")) %>% 
  mutate(
    timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
    treatment = factor(treatment, levels = c("C", "INF"))
  )

Adding Replicate Information

We don’t really know which replicates may be related but
- using the .by argument nests the data invisibly
- This is why it was done in a separate call to mutate()

pcr %>% 
  mutate(
    sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
  ) %>% 
  separate(sample, into = c("treatment", "timepoint")) %>% 
  mutate(
    timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
    treatment = factor(treatment, levels = c("C", "INF")) 
  ) %>% 
  mutate(
    replicate = seq_along(SLAM), .by = c(treatment, timepoint)
  )

. . .

I would usually do this when I load the data

Rewrite the Loading Steps

Re-load the data to make our life easier from here on

pcr <- read_excel("data/RealTimeData.xlsx") %>% 
  rename(sample = `...1`) %>% 
  mutate(
    sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
  ) %>% 
  separate(sample, into = c("treatment", "timepoint")) %>% 
  mutate(
    timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
    treatment = factor(treatment, levels = c("C", "INF")) 
  ) %>% 
  mutate(
    replicate = seq_along(SLAM), .by = c(treatment, timepoint)
  )

Merging Columns

The converse to separate() is unite()
- The below shows manually setting the separator
- Also tells unite() not to remove the original columns

pcr %>% 
  unite(
    col = "merged", all_of(c("treatment","timepoint", "replicate")),
    sep = ":", remove = FALSE
  ) %>% 
  head()

# A tibble: 6 × 6
  merged     treatment timepoint  SLAM GAPDH replicate
  <chr>      <fct>     <fct>     <dbl> <dbl>     <int>
1 C:24hr:1   C         24hr       23.9  17.7         1
2 C:24hr:2   C         24hr       23.5  17.4         2
3 C:24hr:3   C         24hr       23.7  17.8         3
4 INF:24hr:1 INF       24hr       23.4  15.7         1
5 INF:24hr:2 INF       24hr       23.3  15.4         2
6 INF:24hr:3 INF       24hr       23.4  15.8         3

all_of() is another helper related to any_of()
Takes a character vector

Long Form

If we want to plot the values of SLAM & GAPDH \(\implies\) pivot_longer()
- Colloquially known as long form
- Very helpful for regression models & visualisation in R

pcr %>% 
  pivot_longer(
    cols = all_of(c("SLAM", "GAPDH")), names_to = "gene", values_to = "Ct"
  )

# A tibble: 36 × 5
   treatment timepoint replicate gene     Ct
   <fct>     <fct>         <int> <chr> <dbl>
 1 C         24hr              1 SLAM   23.9
 2 C         24hr              1 GAPDH  17.7
 3 C         24hr              2 SLAM   23.5
 4 C         24hr              2 GAPDH  17.4
 5 C         24hr              3 SLAM   23.7
 6 C         24hr              3 GAPDH  17.8
 7 INF       24hr              1 SLAM   23.4
 8 INF       24hr              1 GAPDH  15.7
 9 INF       24hr              2 SLAM   23.3
10 INF       24hr              2 GAPDH  15.4
# ℹ 26 more rows

Visualising The Data

Now we can plot each gene by timepoint and treatment
- Faceting by gene

pcr %>% 
  pivot_longer(
    cols = all_of(c("SLAM", "GAPDH")), names_to = "gene", values_to = "Ct"
  ) %>% 
  ggplot(aes(timepoint, Ct, fill = treatment)) +
  geom_boxplot() +
  facet_wrap(~gene, scales = "free_y")

From Long To Wide

The alternative is pivot_wider()
- Not so great for plotting
- Can be useful for tables

pcr %>% 
  mutate(dCt = SLAM - GAPDH) %>% 
  pivot_wider(
    names_from = "replicate", values_from = "dCt",
    id_cols = starts_with("t"), names_prefix = "rep"
  ) %>% 
  arrange(treatment, timepoint)

# A tibble: 6 × 5
  treatment timepoint  rep1  rep2  rep3
  <fct>     <fct>     <dbl> <dbl> <dbl>
1 C         24hr       6.17  6.04  5.92
2 C         72hr       6.35  5.74  5.83
3 C         120hr      5.88  6.46  6.79
4 INF       24hr       7.72  7.9   7.6 
5 INF       72hr       7.5   7.55  7.79
6 INF       120hr      6.08  5.79  6.02

Closing Comments

The tidyverse is under constant development
- gather() \(\implies\) pivot_longer()
- spread() \(\implies\) pivot_wider()

. . .

separate() is now superseded
- separate_wider_delim()

Footnotes

source unknown↩︎

Using tidyr

The package tidyr

A New Dataset

A New Dataset

A New Dataset

Cleaning Data Entry Errors

Combining Patterns

Separating Columns

Separating Columns

Adding Replicate Information

Rewrite the Loading Steps

Merging Columns

Long Form

Visualising The Data

From Long To Wide

Closing Comments

Footnotes

Using `tidyr`

The package `tidyr`