library(tidyverse)
library(readxl)
<- read_excel("data/RealTimeData.xlsx") %>%
pcr rename(sample = `...1`)
Extending Data Frame Manipulation
RAdelaide 2024
Using tidyr
The package tidyr
- Is the next logical step from
dplyr
- Instead of working with data within columns
\(\implies\) manipulate entire columns - Analogous to some pivot table operations in Excel
- Also a core
tidyverse
package
. . .
- Start a new script:
TidyR.R
A New Dataset
- For this we’ll use the PCR data we glimpsed earlier 1
rename
is anotherdplyr
function- The default empty column name
..1
is a bit weird- Starting
...
makes it a bit special - Needs to be enclosed within backticks
- Starting
A New Dataset
- These values represent Ct values from a RT-PCR analysis
- Ct is the PCR cycle at which the machine detects ‘signal’
- Low numbers mean more of the RNA molecule
- Two genes: SLAM & GAPDH
- GAPDH is a housekeeper gene
- Should be consistent across treatments, conditions etc
A New Dataset
- The sample column also contains two pieces of information
C/INF
probably mean control or infected cells- Timepoints are 24h, 72h or 120h
- Looks like some data entry errors
. . .
- How to separate the cell group (C/INF) from the timepoint?
- If wanting to make boxplots, how would we fill by gene
- Should have a column with gene names
- Another with Ct values
Cleaning Data Entry Errors
- There are multiple ways & steps to clean-up the
sample
column- Need to add spaces where they are missing
- Place the hour annotations in a consistent format
. . .
- My strategy would be to use
str_replace_all()
- Add the missing spaces
- Change the hour format
- We could use two captures for this
Combining Patterns
- The key will be what we place between the captures
- The pattern I chose was
" *"
\(\implies\) match zero or more spaces
%>%
pcr mutate(
sample = str_replace_all(
"^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr"
sample,
) )
# A tibble: 18 × 3
sample SLAM GAPDH
<chr> <dbl> <dbl>
1 C 24hr 23.9 17.7
2 C 24hr 23.5 17.4
3 C 24hr 23.7 17.8
4 INF 24hr 23.4 15.7
5 INF 24hr 23.3 15.4
6 INF 24hr 23.4 15.8
7 C 72hr 24.7 18.4
8 C 72hr 24.3 18.6
9 C 72hr 24.5 18.7
10 INF 72hr 23.4 15.9
11 INF 72hr 23.3 15.8
12 INF 72hr 23.2 15.4
13 C 120hr 24.4 18.5
14 C 120hr 24.7 18.2
15 C 120hr 25.1 18.3
16 INF 120hr 22.8 16.7
17 INF 120hr 22.8 17.0
18 INF 120hr 22.5 16.5
- Notice we didn’t capture the optional space
Separating Columns
- Now we’ve cleaned up the data-entry erors
\(\implies\) useseparate()
to split into two columns- Will automatically separate using any non-alpha-numeric characters
%>%
pcr mutate(
sample = str_replace_all(
"^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr"
sample,
)%>%
) separate(sample, into = c("treatment", "timepoint"))
# A tibble: 18 × 4
treatment timepoint SLAM GAPDH
<chr> <chr> <dbl> <dbl>
1 C 24hr 23.9 17.7
2 C 24hr 23.5 17.4
3 C 24hr 23.7 17.8
4 INF 24hr 23.4 15.7
5 INF 24hr 23.3 15.4
6 INF 24hr 23.4 15.8
7 C 72hr 24.7 18.4
8 C 72hr 24.3 18.6
9 C 72hr 24.5 18.7
10 INF 72hr 23.4 15.9
11 INF 72hr 23.3 15.8
12 INF 72hr 23.2 15.4
13 C 120hr 24.4 18.5
14 C 120hr 24.7 18.2
15 C 120hr 25.1 18.3
16 INF 120hr 22.8 16.7
17 INF 120hr 22.8 17.0
18 INF 120hr 22.5 16.5
Notice that we used a character vector to provide the new colnames
Separating Columns
- These look like categorical values \(\implies\)
factor()
%>%
pcr mutate(
sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
%>%
) separate(sample, into = c("treatment", "timepoint")) %>%
mutate(
timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
treatment = factor(treatment, levels = c("C", "INF"))
)
Adding Replicate Information
- We don’t really know which replicates may be related but
- using the
.by
argument nests the data invisibly - This is why it was done in a separate call to
mutate()
- using the
%>%
pcr mutate(
sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
%>%
) separate(sample, into = c("treatment", "timepoint")) %>%
mutate(
timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
treatment = factor(treatment, levels = c("C", "INF"))
%>%
) mutate(
replicate = seq_along(SLAM), .by = c(treatment, timepoint)
)
. . .
I would usually do this when I load the data
Rewrite the Loading Steps
- Re-load the data to make our life easier from here on
<- read_excel("data/RealTimeData.xlsx") %>%
pcr rename(sample = `...1`) %>%
mutate(
sample = str_replace_all(sample, "^([CINF]+) *([0-9]+)[Hh]", "\\1 \\2hr")
%>%
) separate(sample, into = c("treatment", "timepoint")) %>%
mutate(
timepoint = factor(timepoint, levels = c("24hr", "72hr", "120hr")),
treatment = factor(treatment, levels = c("C", "INF"))
%>%
) mutate(
replicate = seq_along(SLAM), .by = c(treatment, timepoint)
)
Merging Columns
- The converse to
separate()
isunite()
- The below shows manually setting the separator
- Also tells
unite()
not to remove the original columns
%>%
pcr unite(
col = "merged", all_of(c("treatment","timepoint", "replicate")),
sep = ":", remove = FALSE
%>%
) head()
# A tibble: 6 × 6
merged treatment timepoint SLAM GAPDH replicate
<chr> <fct> <fct> <dbl> <dbl> <int>
1 C:24hr:1 C 24hr 23.9 17.7 1
2 C:24hr:2 C 24hr 23.5 17.4 2
3 C:24hr:3 C 24hr 23.7 17.8 3
4 INF:24hr:1 INF 24hr 23.4 15.7 1
5 INF:24hr:2 INF 24hr 23.3 15.4 2
6 INF:24hr:3 INF 24hr 23.4 15.8 3
all_of()
is another helper related toany_of()
- Takes a character vector
Long Form
- If we want to plot the values of SLAM & GAPDH \(\implies\)
pivot_longer()
- Colloquially known as long form
- Very helpful for regression models & visualisation in
R
%>%
pcr pivot_longer(
cols = all_of(c("SLAM", "GAPDH")), names_to = "gene", values_to = "Ct"
)
# A tibble: 36 × 5
treatment timepoint replicate gene Ct
<fct> <fct> <int> <chr> <dbl>
1 C 24hr 1 SLAM 23.9
2 C 24hr 1 GAPDH 17.7
3 C 24hr 2 SLAM 23.5
4 C 24hr 2 GAPDH 17.4
5 C 24hr 3 SLAM 23.7
6 C 24hr 3 GAPDH 17.8
7 INF 24hr 1 SLAM 23.4
8 INF 24hr 1 GAPDH 15.7
9 INF 24hr 2 SLAM 23.3
10 INF 24hr 2 GAPDH 15.4
# ℹ 26 more rows
Visualising The Data
- Now we can plot each gene by timepoint and treatment
- Faceting by gene
%>%
pcr pivot_longer(
cols = all_of(c("SLAM", "GAPDH")), names_to = "gene", values_to = "Ct"
%>%
) ggplot(aes(timepoint, Ct, fill = treatment)) +
geom_boxplot() +
facet_wrap(~gene, scales = "free_y")
From Long To Wide
- The alternative is
pivot_wider()
- Not so great for plotting
- Can be useful for tables
%>%
pcr mutate(dCt = SLAM - GAPDH) %>%
pivot_wider(
names_from = "replicate", values_from = "dCt",
id_cols = starts_with("t"), names_prefix = "rep"
%>%
) arrange(treatment, timepoint)
# A tibble: 6 × 5
treatment timepoint rep1 rep2 rep3
<fct> <fct> <dbl> <dbl> <dbl>
1 C 24hr 6.17 6.04 5.92
2 C 72hr 6.35 5.74 5.83
3 C 120hr 5.88 6.46 6.79
4 INF 24hr 7.72 7.9 7.6
5 INF 72hr 7.5 7.55 7.79
6 INF 120hr 6.08 5.79 6.02
Closing Comments
- The
tidyverse
is under constant developmentgather()
\(\implies\)pivot_longer()
spread
() \(\implies\)pivot_wider()
. . .
separate()
is now supersededseparate_wider_delim()
Footnotes
source unknown↩︎