library(tidyverse)
Functions and Iteration
RAdelaide 2024
Functions
Functions
- Now familiar with using functions
- Writing our own functions is an everyday skill in
R
- Sometimes complex \(\implies\) usually very simple
- Mostly “inline” functions for simple data manipulation
- Very common for axis labels in
ggplot()
- Very common for axis labels in
A Quick Example
%>%
Orange ggplot(
aes(
age, circumference, colour = Tree
)+
) geom_point() +
scale_colour_brewer(
palette = "Set1"
)
. . .
- Let’s say that we wish to add the prefix ‘Tree’ to the legend
A Quick Example
Orange %>%
ggplot(
aes(
age, circumference,
colour = Tree
)
) +
geom_point() +
scale_colour_brewer(
palette = "Set1",
labels = \(x) paste("Tree", x)
)
. . .
\(x)
is shorthand forfunction(x)
(since R v4.1)- All labels are passed to the function as
x
- New function notation was introduced in R v4.1 (2021)
Inline Functions
- This is often referred to as an inline function
- Usually very simple, single line functions
- Often
\(x)
usingx
as the underlying value
- Often
- We could’ve modified the underlying data (but didn’t)
- Also very useful when using
mutate()
to modify columns
Inline Functions
- A common step I use when modifying labels might be
<- c("properly_paired_reads", "unique_alignments")
flagstats %>%
flagstats str_replace_all("_", " ") %>% # Add spaces
str_to_title() %>% # Capitalise the 1st letter
str_wrap(12) # Add line breaks (\n) to a given width
[1] "Properly\nPaired Reads" "Unique\nAlignments"
. . .
- When modifying x-axis labels this might lead to:
scale_x_discrete(
labels = \(x) x %>% str_replace_all("_", " ") %>% str_to_title() %>% str_wrap(12)
)
Piping like this is OK but up to you
Understanding Functions
A function really has multiple aspects
- The
formals()
\(\implies\) the arguments we pass - The
body()
\(\implies\) the code that does stuff - The
environment()
where calculations take place
Let’s look through sd()
starting at the help page ?sd
Understanding Functions
formals(sd)
$x
$na.rm
[1] FALSE
. . .
- These are the arguments (or formals) required by the function
na.rm
has a default value (FALSE
)
Understanding Functions
body(sd)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
. . .
Re-formatted that might be
sqrt(
var(
if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm
) )
Understanding Functions
So the full function might be written as:
<- function(x, na.rm = FALSE) {
sd sqrt(
var(
if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm
)
) }
No need to write this function
Understanding Functions
To make this more understandable
<- function(x, na.rm = FALSE) {
sd ## Coerce to a double, only if needed
if (!is.vector(x) || is.factor(x)) x <- as.double(x)
## Calculate the variance
<- var(x, na.rm = na.rm)
var_x ## Return the square root of the variance
sqrt(var_x)
}
Writing Our Function
- Let’s write that function for modifying labels
- Start by deciding what the function might be called
- Also what arguments we need
<- function(x) {
modify_labels
}
Writing Our Function
- The first step is to change
"_"
to spaces- The last line will be returned by default
<- function(x) {
modify_labels str_replace_all(x, "_", " ")
}modify_labels(flagstats)
[1] "properly paired reads" "unique alignments"
Writing Our Function
- We’re going to modify that again \(\implies\) let’s form an object
- Then return the new object
<- function(x) {
modify_labels <- str_replace_all(x, "_", " ")
new_x
new_x
}modify_labels(flagstats)
[1] "properly paired reads" "unique alignments"
Looking Inside the Function
- Why are we referring to
flagstats
asx
?
. . .
- When we pass it to the function is temporarily renamed
x
\(\implies\) But where is it called x?
. . .
- Each function has it’s own environment
- Within the
GlobalEnvironment
but separate
- Within the
Looking Inside the Function
- If we add a
browser()
command, we can enter the function at that exact point
<- function(x) {
modify_labels browser()
<- str_replace_all(x, "_", " ")
new_x
new_x
}modify_labels(flagstats)
- You should have been jumped to the
Environnment
Tab - The Console should also say
Browse[1]>
This played up when testing both locally & on posit cloud
Looking Inside the Function
- We’re now looking inside the environment within the function
- By passing
flagstats
asx
\(\implies\) this is it’s name within this environment - Modifying within the function will not change the object in the Global Environment
. . .
- Within the browser execute
new_x <- str_replace_all(x, "_", " ")
. . .
- What happened?
- Should’ve seen the new object created in the function environment
Writing Our Function
- Exit the
browser()
by typingQ
- Remove the
browser()
line by adding a comment (#
) to the start
. . .
- Now we can finish the function
<- function(x) {
modify_labels # browser()
<- str_replace_all(x, "_", " ")
new_x <- str_to_title(new_x)
new_x <- str_wrap(new_x, 12)
new_x
new_x
}modify_labels(flagstats)
[1] "Properly\nPaired Reads" "Unique\nAlignments"
. . .
Why didn’t I use a pipe?
Extending Our Function
- Can we also control the width at which the text wraps
- Hard-wired to
12
internally
- Hard-wired to
. . .
- Add an extra argument called
width
with default value of 12- Now this can be changed any time we call the function
<- function(x, width = 12) {
modify_labels # browser()
<- str_replace_all(x, "_", " ")
new_x <- str_to_title(new_x)
new_x <- str_wrap(new_x, width)
new_x
new_x
}modify_labels(flagstats)
[1] "Properly\nPaired Reads" "Unique\nAlignments"
modify_labels(flagstats, 80)
[1] "Properly Paired Reads" "Unique Alignments"
Extending Our Function
- In many help pages you may see
...
as a function argument - This allows for passing arguments to internal functions
- Are not required to be set specifically
- Check the help page
?str_wrap
. . .
- Notice there are four additional arguments:
width
,indent
,exdent
andwhitespace_only
Extending Our Function
- Let’s remove width from our list of formal arguments
- Replace with
...
- Pass
...
insidestr_wrap
<- function(x, ...) {
modify_labels # browser()
<- str_replace_all(x, "_", " ")
new_x <- str_to_title(new_x)
new_x <- str_wrap(new_x, ...)
new_x
new_x
}modify_labels(flagstats)
[1] "Properly Paired Reads" "Unique Alignments"
modify_labels(flagstats, width = 12)
[1] "Properly\nPaired Reads" "Unique\nAlignments"
Extending Our Function
- The default values will now be applied by
str_wrap()
unless we change them - The following would indent the first line by 2 spaces
- May be a little pointless …
modify_labels(flagstats, width = 12, indent = 2)
[1] " Properly\nPaired Reads" " Unique\nAlignments"
. . .
- When writing functions need to choose target function carefully
Finishing Our Function
- This is a more formal implementation of the original process
- These two are functionally identical
- One more suited for inline use as a one-off
- The other may be more useful if using repeatedly
<- \(x, ...) x %>% str_replace_all("_", " ") %>% str_to_title() %>% str_wrap(...)
one_liner one_liner(flagstats, width = 12)
[1] "Properly\nPaired Reads" "Unique\nAlignments"
modify_labels(flagstats, width = 12)
[1] "Properly\nPaired Reads" "Unique\nAlignments"
. . .
What else could we have done?
- Added comments to explain each line
- Added checks like
stopifnot(is.character(x))
Iteration
Iteration
R
sees everything as vectors- We didn’t need to modify each value of
flagstats
- Not the case for most languages
python
,C
,C++
.perl
etc step through vectors
\(\implies\)process one value at a time
Iteration
- A python-like process might’ve been
<- c()
new_x for (x in flagstats) {
<- c(new_x, modify_labels(x, width = 12))
new_x
} new_x
[1] "Properly\nPaired Reads" "Unique\nAlignments"
. . .
- We formed an empty object
- Stepped through every value in flagstats
- Modified values one at a time & added to
new_x
. . .
Stupidly slow in R
Iteration
- Each value was called
x
as we stepped through it x
is just a convention \(\implies\) can be anything (i
,bob
etc)
for (x in flagstats) {
print(x)
}
[1] "properly_paired_reads"
[1] "unique_alignments"
for (i in flagstats) {
print(i)
}
[1] "properly_paired_reads"
[1] "unique_alignments"
for (bob in flagstats) {
print(bob)
}
[1] "properly_paired_reads"
[1] "unique_alignments"
Iteration
- Because
R
works on vectors \(\implies\) rarely need to iterate on vectors
. . .
- Lists however…
. . .
- How would we get the length for each list element?
<- list(letters = letters, num = rnorm(1000)) vals
. . .
Iteration is probably our first, best guess…
Iteration
<- list()
len for (x in vals) {
<- c(len, length(x))
len
}names(len) <- names(vals)
len
$letters
[1] 26
$num
[1] 1000
The above
- Initialises an empty list
len
- Steps through each element calling it
x
- Finds the length of
x
and extendslen
- Adds the names so we know which value goes with which element
The R
Way to Iterate
- Conventional iteration is very slow in
R
- Provides the function
lapply
- Stands for list apply
- Applies a function to each element of a list
. . .
- Syntax is
lapply(list, function, ...)
. . .
lapply(vals, length)
$letters
[1] 26
$num
[1] 1000
- No
...
arguments needed here
Much simpler way to work & is faster
The R
Way to Iterate
A quick example using the ...
lapply(vals, head, n = 4)
$letters
[1] "a" "b" "c" "d"
$num
[1] -0.06136509 1.17332403 -1.37271761 -0.97836301
The R
Way to Iterate
lapply()
will always return a list- Safest option
- Calling
head
gave two elements of different types - Calling
length
gave twointeger
elements
- Calling
. . .
- We could’ve coerced the lengths into a vector
lapply(vals, length) %>% unlist()
letters num
26 1000
. . .
- Only useful if returning a common type
The R
Way to Iterate
- If we know what we’ll get\(\implies\)
map_*()
functions - Part of
purrr
\(\implies\) coretidyverse
package
map_int(vals, length)
letters num
26 1000
. . .
- Alternatives are
map_chr()
,map_lgl()
,map_dbl()
- Will error if setting the wrong type
- Only used when single values are returned
Getting Real
SNP Data
- For the rest of this session we’ll look at some genotype data
- Will put all the day’s material into practice
- Everyone will have very different applications
- Hopefully will help you figure out best approach for your data
. . .
- Simulated data
- Based on surviving moths after exposure to freezing temperature
SNP Data
<- read_csv("data/snps.csv")
snps dim(snps)
[1] 104 2001
1:6, 1:10] snps[
# A tibble: 6 × 10
Population SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Control AB BB AB BB BB BB BB AB AB
2 Control AB AB AB AB BB <NA> AA AA AB
3 Control BB AB AA AA AA AB AB BB AB
4 Control AB AB BB AB AB AB AA AA BB
5 Control BB BB AB BB AB AA AA AB AB
6 Control AB AB AA AA AB <NA> AB BB AB
. . .
- Each row represents a surviving moth
- We have 104 moths with 2000 SNP genotypes
SNP Data
Our task is to:
- Perform Fisher’s Exact Test on each SNP locus
- Tests for association between genotype and survival
- Could be allele count or genotypes
- Dominant or recessive model
- Decide which values to return
- Probably a p-value
- Do we want genotype counts? Odds Ratios?
. . .
lapply()
and functions will be our friends
SNP Data
- First let’s check our population sizes
%>% summarise(n = dplyr::n(), .by = Population) snps
# A tibble: 2 × 2
Population n
<chr> <int>
1 Control 53
2 Treat 51
Missing Genotypes
- Check the missing genotypes
. . .
- Know we know functions \(\implies\)
across()
- Is a
dplyr
function- Enables us to apply a function across zero or more columns
- Uses
tidyselect
helpers
- The help page has lots of information
- Lets use it first
Missing Genotypes
- Apply the function
is.na()
across all columns that start_with “SNP”
%>%
snps mutate(across(starts_with("SNP"), is.na)) %>%
::select(1:10) dplyr
# A tibble: 104 × 10
Population SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9
<chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 Control FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
3 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 Control FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
7 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
8 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
9 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 Control FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# ℹ 94 more rows
Missing Genotypes
- If we pass to
summarise()
we can count these accross all SNPs- Perfect opportunity for an inline function
%>%
snps summarise(
across(starts_with("SNP"), \(x) sum(is.na(x)))
%>%
) ::select(1:10) dplyr
# A tibble: 1 × 10
SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 3 0 1 0 2 3 1 1 1 1
Missing Genotypes
- This gives the missing count for all 2000 loci
- Maybe
pivot_longer()
might help
%>%
snps summarise(
across(starts_with("SNP"), \(x) sum(is.na(x)))
%>%
) pivot_longer(everything(), names_to = "locus", values_to = "missing")
# A tibble: 2,000 × 2
locus missing
<chr> <int>
1 SNP1 3
2 SNP2 0
3 SNP3 1
4 SNP4 0
5 SNP5 2
6 SNP6 3
7 SNP7 1
8 SNP8 1
9 SNP9 1
10 SNP10 1
# ℹ 1,990 more rows
Missing Genotypes
- Now we can summarise again
- Will make a nice descriptive table in our
rmarkdown
report
%>%
snps summarise(
across(starts_with("SNP"), \(x) sum(is.na(x)))
%>%
) pivot_longer(everything(), names_to = "locus", values_to = "missing") %>%
summarise(n = dplyr::n(), .by = missing) %>%
arrange(desc(n))
# A tibble: 7 × 2
missing n
<int> <int>
1 1 721
2 0 706
3 2 361
4 3 164
5 4 39
6 5 8
7 6 1
Performing an Analysis
- Let’s see if the
A
allele acts in a dominant manner - Compare the numbers with A alleles across populations
. . .
- Classic Fisher’s Exact Test using a 2x2 table
A_TRUE | A_FALSE | |
---|---|---|
Control | a | b |
Treat | c | d |
. . .
- No right or wrong strategy
Performing an Analysis
- Start by converting to long form
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype")
. . .
- Can easily remove the missing genotypes
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) dplyr
Performing an Analysis
- Check for the presence of an
A
allele
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A"))
Performing an Analysis
- Now count by presence of A
- Set the grouping to be by Population, locus &
A
status
- Set the grouping to be by Population, locus &
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A))
Performing an Analysis
- Move the counts into
TRUE/FALSE
columns- The 2x2 tables now start to appear
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) arrange(locus)
Performing an Analysis
- We can form a nested
tibble
for each locus
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) nest(df = c(Population, starts_with("A_")))
Nesting Columns
- This is a new idea \(\implies\) we now have a
list
column - Look at the first one to see what the elements look like
- Not part of the analysis
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) nest(df = c(Population, starts_with("A_"))) %>%
slice(1) %>% pull(df)
Using lapply()
On Nested Columns
- We can use
lapply()
to perform an analysis on every nested df
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) nest(df = c(Population, starts_with("A_"))) %>%
mutate(
ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")]))
)
Using lapply()
On Nested Columns
- We now have a new list column with a list of results from each test
- Objects of class
htest
- Will have an element called
p.value
- This is a
double
(i.e.numeric
)
. . .
- We can use
map_dbl()
to grab these values
Using lapply()
On Nested Columns
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) nest(df = c(Population, starts_with("A_"))) %>%
mutate(
ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
p = map_dbl(ft, \(x) x$p.value),
)
Using lapply()
On Nested Columns
- How about an Odds Ratio?
- The OR is in an element called
estimate
- The OR is in an element called
. . .
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) nest(df = c(Population, starts_with("A_"))) %>%
mutate(
ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
OR = map_dbl(ft, \(x) x$estimate),
p = map_dbl(ft, \(x) x$p.value),
)
Using lapply()
On Nested Columns
- Getting counts will require using
df
again
. . .
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) nest(df = c(Population, starts_with("A_"))) %>%
mutate(
ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
Control = map_int(df, \(x) dplyr::filter(x, Population == "Control")[["A_TRUE"]]),
Treat = map_int(df, \(x) dplyr::filter(x, Population == "Treat")[["A_TRUE"]]),
OR = map_dbl(ft, \(x) x$estimate),
p = map_dbl(ft, \(x) x$p.value),
)
The Final Analysis
%>%
snps pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>%
::filter(!is.na(genotype)) %>%
dplyrmutate(A = str_detect(genotype, "A")) %>%
summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>%
pivot_wider(
names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
%>%
) nest(df = c(Population, starts_with("A_"))) %>%
mutate(
ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
Control = map_int(df, \(x) dplyr::filter(x, Population == "Control")[["A_TRUE"]]),
Treat = map_int(df, \(x) dplyr::filter(x, Population == "Treat")[["A_TRUE"]]),
OR = map_dbl(ft, \(x) x$estimate),
p = map_dbl(ft, \(x) x$p.value),
adj_p = p.adjust(p, "bonferroni")
%>%
) arrange(p)
# A tibble: 2,000 × 8
locus df ft Control Treat OR p adj_p
<chr> <list> <list> <int> <int> <dbl> <dbl> <dbl>
1 SNP1716 <tibble [2 × 3]> <htest> 47 8 38.7 2.17e-14 4.35e-11
2 SNP1236 <tibble [2 × 3]> <htest> 46 11 22.9 1.29e-11 2.57e- 8
3 SNP1618 <tibble [2 × 3]> <htest> 45 12 22.7 3.00e-11 6.00e- 8
4 SNP248 <tibble [2 × 3]> <htest> 44 10 18.8 7.91e-11 1.58e- 7
5 SNP1730 <tibble [2 × 3]> <htest> 43 10 17.0 3.27e-10 6.54e- 7
6 SNP311 <tibble [2 × 3]> <htest> 41 10 13.5 2.52e- 9 5.04e- 6
7 SNP1385 <tibble [2 × 3]> <htest> 45 14 14.0 3.70e- 9 7.40e- 6
8 SNP1647 <tibble [2 × 3]> <htest> 43 13 13.5 4.28e- 9 8.57e- 6
9 SNP8 <tibble [2 × 3]> <htest> 46 16 13.5 9.41e- 9 1.88e- 5
10 SNP1993 <tibble [2 × 3]> <htest> 40 11 11.7 1.74e- 8 3.47e- 5
# ℹ 1,990 more rows
For those with an OR > 1 the A allele might be a dominant susceptibility locus
Summary
- We could save this as a final object
- Select our important columns and prepare a table
Summary
For this we needed to understand
- When to use
pivot_longer()
andpivot_wider()
- What is a
list
,vector
anddata.frame
? - Difference between
integer
anddouble
values tidyselect
helper functions +dplyr
- How to use
lapply()
with inline functions - Extending
lapply()
usingmap_*()
to produce vector output
Summary
- Alternatives to
map_*()
aresapply()
andvapply()
sapply()
is slightly unpredictablevapply()
is a bit more clunky but powerful
. . .
- Could’ve use
unlist(lapply(...))
Why didn’t we?
Specifying the exact type will error if we’re wrong. This is good