Functions and Iteration

RAdelaide 2024

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 10, 2024

Functions

Now familiar with using functions
Writing our own functions is an everyday skill in R
Sometimes complex \(\implies\) usually very simple
Mostly “inline” functions for simple data manipulation
- Very common for axis labels in ggplot()

library(tidyverse)

A Quick Example

Orange %>% 
  ggplot(
    aes(
      age, circumference, 
      colour = Tree
    )
  ) +
  geom_point() +
  scale_colour_brewer(
    palette = "Set1"
  )

Let’s say that we wish to add the prefix ‘Tree’ to the legend

A Quick Example

Orange %>% 
  ggplot(
    aes(
      age, circumference,
      colour = Tree
    )
  ) +
  geom_point() +
  scale_colour_brewer(
    palette = "Set1",
    labels = \(x) paste("Tree", x)
  )

\(x) is shorthand for function(x) (since R v4.1)
All labels are passed to the function as x

Inline Functions

This is often referred to as an inline function
Usually very simple, single line functions
- Often \(x) using x as the underlying value
We could’ve modified the underlying data (but didn’t)
Also very useful when using mutate() to modify columns

Inline Functions

A common step I use when modifying labels might be

flagstats <- c("properly_paired_reads", "unique_alignments")
flagstats %>% 
  str_replace_all("_", " ") %>% # Add spaces
  str_to_title() %>% # Capitalise the 1st letter
  str_wrap(12) # Add line breaks (\n) to a given width

[1] "Properly\nPaired Reads" "Unique\nAlignments"

When modifying x-axis labels this might lead to:

scale_x_discrete(
  labels = \(x) x %>% str_replace_all("_", " ") %>% str_to_title() %>% str_wrap(12)
)

Understanding Functions

A function really has multiple aspects

The formals() \(\implies\) the arguments we pass
The body() \(\implies\) the code that does stuff
The environment() where calculations take place

Let’s look through sd() starting at the help page ?sd

Understanding Functions

formals(sd)

$x


$na.rm
[1] FALSE

These are the arguments (or formals) required by the function
na.rm has a default value (FALSE)

Understanding Functions

body(sd)

sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))

To make this more understandable

sd <- function(x, na.rm = FALSE) {
  ## Coerce to a double, only if needed
  if (!is.vector(x) || is.factor(x)) x <- as.double(x)
  ## Calculate the variance
  var_x <- var(x, na.rm = na.rm)
  ## Return the square root of the variance
  sqrt(var_x)
}

Writing Our Function

Let’s write that function for modifying labels
Start by deciding what the function might be called
- Also what arguments we need

modify_labels <- function(x) {
  
}

Writing Our Function

The first step is to change "_" to spaces
- The last line of a function will be returned by default

modify_labels <- function(x) {
  str_replace_all(x, "_", " ") # Replace all '_' with spaces
}
modify_labels(flagstats)

[1] "properly paired reads" "unique alignments"

Writing Our Function

We’re going to modify that again \(\implies\) let’s form an object
- Then return the new object

modify_labels <- function(x) {
  new_x <- str_replace_all(x, "_", " ") # Replace all '_' with spaces
  new_x # Return our final object
}
modify_labels(flagstats)

[1] "properly paired reads" "unique alignments"

Writing Our Function

Why are we referring to flagstats as x?

When we pass it to the function is temporarily renamed x
\(\implies\) But where is it called x?

Each function has it’s own internal environment
- Nested within the GlobalEnvironment but like “a separate bubble”

Writing Our Function

To complete the function

modify_labels <- function(x) {
  new_x <- str_replace_all(x, "_", " ") # Replace all '_' with spaces
  new_x <- str_to_title(new_x) # Start each word with an uppercase letter
  new_x <- str_wrap(new_x, width = 12) # Add line breaks after 12 characters
  new_x # Return our final object
}
modify_labels(flagstats)

[1] "Properly\nPaired Reads" "Unique\nAlignments"

Extending Our Function

Can we also control the width at which the text wraps
- Hard-wired to 12 internally

Add an extra argument called width with default value of 12
- Now this can be changed any time we call the function

modify_labels <- function(x, width = 12) {
  new_x <- str_replace_all(x, "_", " ") # Replace all '_' with spaces
  new_x <- str_to_title(new_x) # Start each word with an uppercase letter
  new_x <- str_wrap(new_x, width = width) # Add line breaks where requested
  new_x # Return our final object
}
modify_labels(flagstats)

[1] "Properly\nPaired Reads" "Unique\nAlignments"

modify_labels(flagstats, 80)

[1] "Properly Paired Reads" "Unique Alignments"

Extending Our Function

In many help pages \(\implies\) ... as a function argument
This allows for passing arguments to internal function calls
- Are not required to be set specifically
Check the help page ?str_wrap

Notice there are four additional arguments:
- width, indent, exdent and whitespace_only

Extending Our Function

Let’s remove width from our list of formal arguments
Replace with ...
Pass ... inside str_wrap

modify_labels <- function(x, ...) {
  new_x <- str_replace_all(x, "_", " ") # Replace all '_' with spaces
  new_x <- str_to_title(new_x) # Start each word with an upper-case letter
  new_x <- str_wrap(new_x, ...) # Add line breaks where requested
  new_x # Return our final object
}
modify_labels(flagstats)
modify_labels(flagstats, width = 12)
modify_labels(flagstats, width = 12, indent = 5)

Iteration

R sees everything as vectors
We didn’t need to modify each value of flagstats
- Not the case for most languages
- python, C, C++, perl etc step through vectors
  \(\implies\)process one value at a time

Iteration

# No need to type this code. Just look at my dumb examples
for (x in flagstats) {
  print(x) 
}

[1] "properly_paired_reads"
[1] "unique_alignments"

Each value was called x as we stepped through it
x is just a convention \(\implies\) can be anything (i, bob etc)

# More dumb examples
for (i in flagstats) {
  print(i)
}

[1] "properly_paired_reads"
[1] "unique_alignments"

for (bob in flagstats) {
  print(bob)
}

[1] "properly_paired_reads"
[1] "unique_alignments"

Iteration

Because R works on vectors
\(\implies\) almost never need to iterate on vectors

Lists however…

How would we get the length for each list element?

vals <- list(letters = letters, num = rnorm(1000))

Iteration is probably our first, best guess…

Iteration

# Never do this. It's just an example...
len <- c() # Initialise an empty object
for (x in vals) { # Step through 'vals' calling each element 'x'
  len <- c(len, length(x)) # Add the values as we step through
}
len

[1]   26 1000

The above:

Initialises an empty vector len
Steps through each element calling it x
Finds the length of x and extends len

The `R` Way to Iterate

Conventional iteration is very slow in R
Provides the function lapply
- Stands for list apply
- Applies a function to each element of a list

Basic syntax is lapply(list, function)

lapply(vals, length)

$letters
[1] 26

$num
[1] 1000

The `R` Way to Iterate

Additional arguments can also be passed
The full syntax is lapply(list, function, ...)

lapply(vals, head, n = 4)

$letters
[1] "a" "b" "c" "d"

$num
[1] 0.04135964 1.42256234 0.17197915 1.60996354

The `R` Way to Iterate

lapply() will always return a list
Safest option
- Calling head gave two elements of different types
- Calling length gave two integer elements

We could’ve coerced the lengths into a vector

lapply(vals, length) %>% unlist()

letters     num 
     26    1000

Only useful if returning a common type

The `R` Way to Iterate

If we know what we’ll get\(\implies\) map_*() functions
Part of purrr \(\implies\) core tidyverse package

map_int(vals, length)

letters     num 
     26    1000

Alternatives are map_chr(), map_lgl(), map_dbl()
Will error if setting the wrong type
Only used when single values are returned

Getting Real

SNP Data

For the rest of this session we’ll look at some genotype data
Will put all the day’s material into practice
Everyone will have very different applications
Hopefully will help you figure out best approach for your data

Simulated data
Based on surviving moths after exposure to freezing temperature

SNP Data

snps <- read_csv("data/snps.csv")
dim(snps)

[1]  104 2001

snps[1:6, 1:10]

# A tibble: 6 × 10
  Population SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9 
  <chr>      <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Control    AB    BB    AB    BB    BB    BB    BB    AB    AB   
2 Control    AB    AB    AB    AB    BB    <NA>  AA    AA    AB   
3 Control    BB    AB    AA    AA    AA    AB    AB    BB    AB   
4 Control    AB    AB    BB    AB    AB    AB    AA    AA    BB   
5 Control    BB    BB    AB    BB    AB    AA    AA    AB    AB   
6 Control    AB    AB    AA    AA    AB    <NA>  AB    BB    AB

Each row represents a surviving moth
We have 104 moths with 2000 SNP genotypes

SNP Data

Our task is to:

Perform Fisher’s Exact Test on each SNP locus
- Tests for association between genotype and survival
- Could be allele count or genotypes
- Dominant or recessive model
Decide which values to return
- Probably a p-value
- Do we want genotype counts? Odds Ratios?

lapply() and functions will be our friends

SNP Data

First let’s check our population sizes

snps %>% summarise(n = dplyr::n(), .by = Population)

# A tibble: 2 × 2
  Population     n
  <chr>      <int>
1 Control       53
2 Treat         51

Missing Genotypes

Check the missing genotypes

Know we know functions \(\implies\) across()
Is a dplyr function
- Enables us to apply a function across zero or more columns
- Uses tidyselect helpers
The help page has lots of information
- Lets use it first

Missing Genotypes

Apply the function is.na() across all columns that start_with “SNP”

snps %>% 
  mutate(across(starts_with("SNP"), is.na)) %>% 
  dplyr::select(1:10)

# A tibble: 104 × 10
   Population SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9 
   <chr>      <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2 Control    FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE
 3 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 4 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 5 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 6 Control    FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE
 7 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 8 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 9 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 Control    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# ℹ 94 more rows

Missing Genotypes

If we pass to summarise() we can count these accross all SNPs
- Perfect opportunity for an inline function

snps %>% 
  summarise(
    across(starts_with("SNP"), \(x) sum(is.na(x)))
  ) %>% 
  dplyr::select(1:10)

# A tibble: 1 × 10
   SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9 SNP10
  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1     3     0     1     0     2     3     1     1     1     1

Missing Genotypes

This gives the missing count for all 2000 loci
Maybe pivot_longer() might help

snps %>% 
  summarise(
    across(starts_with("SNP"), \(x) sum(is.na(x)))
  ) %>% 
  pivot_longer(everything(), names_to = "locus", values_to = "missing")

# A tibble: 2,000 × 2
   locus missing
   <chr>   <int>
 1 SNP1        3
 2 SNP2        0
 3 SNP3        1
 4 SNP4        0
 5 SNP5        2
 6 SNP6        3
 7 SNP7        1
 8 SNP8        1
 9 SNP9        1
10 SNP10       1
# ℹ 1,990 more rows

Missing Genotypes

Now we can summarise again
Will make a nice descriptive table in our rmarkdown report

snps %>% 
  summarise(
    across(starts_with("SNP"), \(x) sum(is.na(x)))
  ) %>% 
  pivot_longer(everything(), names_to = "locus", values_to = "missing") %>% 
  summarise(n = dplyr::n(), .by = missing) %>% 
  arrange(desc(n))

# A tibble: 7 × 2
  missing     n
    <int> <int>
1       1   721
2       0   706
3       2   361
4       3   164
5       4    39
6       5     8
7       6     1

Performing an Analysis

Let’s see if the A allele acts in a dominant manner
Compare the numbers with A alleles across populations

Classic Fisher’s Exact Test using a 2x2 table

	A_TRUE	A_FALSE
Control	a	b
Treat	c	d

No right or wrong strategy

Performing an Analysis

Start by converting to long form

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype")

Can easily remove the missing genotypes

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype))

Performing an Analysis

Check for the presence of an A allele

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A"))

Performing an Analysis

Now count by presence of A
- Set the grouping to be by Population, locus & A status

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A))

Performing an Analysis

Move the counts into TRUE/FALSE columns
- The 2x2 tables now start to appear

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  arrange(locus)

Performing an Analysis

We can form a nested tibble for each locus

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  nest(df = c(Population, starts_with("A_")))

Nesting Columns

This is a new idea \(\implies\) we now have a list column
Look at the first one to see what the elements look like
- Not part of the analysis

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  nest(df = c(Population, starts_with("A_"))) %>% 
  slice(1) %>% pull(df)

Using `lapply()` On Nested Columns

We can use lapply() to perform an analysis on every nested df

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  nest(df = c(Population, starts_with("A_"))) %>% 
  mutate(
    ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")]))
  )

Using `lapply()` On Nested Columns

We now have a new list column with a list of results from each test
Objects of class htest
Will have an element called p.value
This is a double (i.e. numeric)

We can use map_dbl() to grab these values

Using `lapply()` On Nested Columns

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  nest(df = c(Population, starts_with("A_"))) %>% 
  mutate(
    ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
    p = map_dbl(ft, \(x) x$p.value),
  )

Using `lapply()` On Nested Columns

How about an Odds Ratio?
- The OR is in an element called estimate

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  nest(df = c(Population, starts_with("A_"))) %>% 
  mutate(
    ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
    OR = map_dbl(ft, \(x) x$estimate),
    p = map_dbl(ft, \(x) x$p.value),
  )

Using `lapply()` On Nested Columns

Getting counts will require using df again

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  nest(df = c(Population, starts_with("A_"))) %>% 
  mutate(
    ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
    Control = map_int(df, \(x) dplyr::filter(x, Population == "Control")[["A_TRUE"]]),
    Treat = map_int(df, \(x) dplyr::filter(x, Population == "Treat")[["A_TRUE"]]),
    OR = map_dbl(ft, \(x) x$estimate),
    p = map_dbl(ft, \(x) x$p.value),
  )

The Final Analysis

snps %>% 
  pivot_longer(starts_with("SNP"), names_to = "locus", values_to = "genotype") %>% 
  dplyr::filter(!is.na(genotype)) %>% 
  mutate(A = str_detect(genotype, "A")) %>% 
  summarise(n = dplyr::n(), .by = c(Population, locus, A)) %>% 
  pivot_wider(
    names_from = "A", values_from = "n", values_fill = 0, names_prefix = "A_"
  ) %>% 
  nest(df = c(Population, starts_with("A_"))) %>% 
  mutate(
    ft = lapply(df, \(x) fisher.test(x[, c("A_TRUE", "A_FALSE")])),
    Control = map_int(df, \(x) dplyr::filter(x, Population == "Control")[["A_TRUE"]]),
    Treat = map_int(df, \(x) dplyr::filter(x, Population == "Treat")[["A_TRUE"]]),
    OR = map_dbl(ft, \(x) x$estimate),
    p = map_dbl(ft, \(x) x$p.value),
    adj_p = p.adjust(p, "bonferroni")
  ) %>% 
  arrange(p)

The Final Analysis

# A tibble: 2,000 × 8
   locus   df               ft      Control Treat    OR        p    adj_p
   <chr>   <list>           <list>    <int> <int> <dbl>    <dbl>    <dbl>
 1 SNP1716 <tibble [2 × 3]> <htest>      47     8  38.7 2.17e-14 4.35e-11
 2 SNP1236 <tibble [2 × 3]> <htest>      46    11  22.9 1.29e-11 2.57e- 8
 3 SNP1618 <tibble [2 × 3]> <htest>      45    12  22.7 3.00e-11 6.00e- 8
 4 SNP248  <tibble [2 × 3]> <htest>      44    10  18.8 7.91e-11 1.58e- 7
 5 SNP1730 <tibble [2 × 3]> <htest>      43    10  17.0 3.27e-10 6.54e- 7
 6 SNP311  <tibble [2 × 3]> <htest>      41    10  13.5 2.52e- 9 5.04e- 6
 7 SNP1385 <tibble [2 × 3]> <htest>      45    14  14.0 3.70e- 9 7.40e- 6
 8 SNP1647 <tibble [2 × 3]> <htest>      43    13  13.5 4.28e- 9 8.57e- 6
 9 SNP8    <tibble [2 × 3]> <htest>      46    16  13.5 9.41e- 9 1.88e- 5
10 SNP1993 <tibble [2 × 3]> <htest>      40    11  11.7 1.74e- 8 3.47e- 5
# ℹ 1,990 more rows

Summary

We could save this as a final object
Select our important columns and prepare a table

Summary

For this we needed to understand

When to use pivot_longer() and pivot_wider()
What is a list, vector and data.frame?
Difference between integer and double values
tidyselect helper functions + dplyr
How to use lapply() with inline functions
Extending lapply() using map_*() to produce vector output

Summary

Alternatives to map_*() are sapply() and vapply()
- sapply() is slightly unpredictable
- vapply() is a bit more clunky but powerful

Could’ve use unlist(lapply(...))

Why didn’t we?

Functions and Iteration

Functions

Functions

A Quick Example

A Quick Example

Inline Functions

Inline Functions

Understanding Functions

Understanding Functions

Understanding Functions

Writing Our Function

Writing Our Function

Writing Our Function

Writing Our Function

Writing Our Function

Extending Our Function

Extending Our Function

Extending Our Function

Iteration

Iteration

Iteration

Iteration

Iteration

The R Way to Iterate

The R Way to Iterate

The R Way to Iterate

The R Way to Iterate

Getting Real

SNP Data

SNP Data

SNP Data

SNP Data

Missing Genotypes

Missing Genotypes

Missing Genotypes

Missing Genotypes

Missing Genotypes

Performing an Analysis

Performing an Analysis

Performing an Analysis

Performing an Analysis

Performing an Analysis

Performing an Analysis

Nesting Columns

Using lapply() On Nested Columns

Using lapply() On Nested Columns

Using lapply() On Nested Columns

Using lapply() On Nested Columns

Using lapply() On Nested Columns

The Final Analysis

The Final Analysis

Summary

Summary

Summary

The `R` Way to Iterate

The `R` Way to Iterate

The `R` Way to Iterate

The `R` Way to Iterate

Using `lapply()` On Nested Columns

Using `lapply()` On Nested Columns

Using `lapply()` On Nested Columns

Using `lapply()` On Nested Columns

Using `lapply()` On Nested Columns