library(tidyverse)
library(palmerpenguins)
library(ggpmisc)
theme_set(theme_bw())
Iteration And Lists
RAdelaide 2025
Iteration
Introducing Iteration
R
is explicitly designed to work with vectors- Where it’s elegance, power and speed comes from
- We can avoid stepping through each value like many other languages
BUT
- If we have a vector of file paths, how would we load them all?
- If we have a list of linear models, how do we deal with this?
- If we have multiple cell types subjected to the same experimental treatment, how do we combine and compare results?
Stepping Through A Vector
- In the directory
data/benchmarks
are 6 very similar files - Running a parameter sweep on
phoenix
(UofA HPC)- Simulated DNA sequences with AHR motifs
- Fitting a poisson model to test for enrichment relative to control
- Changing the size of the control dataset (n = 10, 50, 100, 250, 500, 100)
- Reporting the resource usage \(\implies\) a simple 2-line file
<- here::here("data", "benchmarks") |>
f list.files(pattern = "tsv$", full.names = TRUE)
f
[1] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n10.benchmark.tsv"
[2] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n100.benchmark.tsv"
[3] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n1000.benchmark.tsv"
[4] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n250.benchmark.tsv"
[5] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n50.benchmark.tsv"
[6] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n500.benchmark.tsv"
Working With Lists
- We can use
bind_rows()
to form a singletibble
- Including the
.id
argument will add list names to a column
- Including the
|> bind_rows(.id = "file") df_list
# A tibble: 6 × 11
file s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
<chr> <dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AHR.pois… 11.2 00'11" 4172. 17559. 829. 1194. 0 0 385.
2 AHR.pois… 65.2 01'05" 4950. 18337. 3986. 4086. 0 0 496.
3 AHR.pois… 594. 09'54" 15535. 28940. 9731. 10368. 0 0 700.
4 AHR.pois… 154. 02'34" 6462. 19867. 4845. 5017. 0 0 667.
5 AHR.pois… 35.1 00'35" 4435. 17841. 914. 1264. 0 0 258.
6 AHR.pois… 300. 04'59" 9814. 23201. 6805. 7132. 0 0 706.
# ℹ 1 more variable: cpu_time <dbl>
- From here, use
mutate()
to extract n = 10, 50, … - Make a lovely plot
Using an R
-Style Approach
R
offers an approach usinglapply()
- Stands for list-apply
- We apply a function to each element of a vector
- Will always return a list
lapply(f, read_tsv)
[[1]]
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11.2 00'11" 4172. 17559. 829. 1194. 0 0 385. 49.8
[[2]]
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 65.2 01'05" 4950. 18337. 3986. 4086. 0 0 496. 329.
[[3]]
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 594. 09'54" 15535. 28940. 9731. 10368. 0 0 700. 4382.
[[4]]
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 154. 02'34" 6462. 19867. 4845. 5017. 0 0 667. 1034.
[[5]]
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 35.1 00'35" 4435. 17841. 914. 1264. 0 0 258. 94.6
[[6]]
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 300. 04'59" 9814. 23201. 6805. 7132. 0 0 706. 2121.
Beyond lapply()
Using apply()
- A common scenario is to perform an operation on a matrix
- Can summarise by row or column
- For this we use
apply()
including theMARGIN
argument- For rows:
MARGIN = 1
- For columns:
MARGIN = 2
- For rows:
- Can even be applied to 3D arrays using
MARGIN=3
Alternatives to lapply()
- Can be a little unpredictable as seen here
- A common alternative to
lapply()
issapply()
- Tries to simplify by default
- Automatically uses the elements of
x
as names
sapply(f, read_tsv)
sapply(f, read_tsv, simplify = FALSE)
The Package purrr
- The
tidyverse
packagepurrr
reimplements these usingmap()
- The idea is we map an input to an output
map()
mostly replicateslapply()
|>
f setNames(basename(f)) |>
map(read_tsv)
$AHR.poisson.n10.benchmark.tsv
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11.2 00'11" 4172. 17559. 829. 1194. 0 0 385. 49.8
$AHR.poisson.n100.benchmark.tsv
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 65.2 01'05" 4950. 18337. 3986. 4086. 0 0 496. 329.
$AHR.poisson.n1000.benchmark.tsv
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 594. 09'54" 15535. 28940. 9731. 10368. 0 0 700. 4382.
$AHR.poisson.n250.benchmark.tsv
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 154. 02'34" 6462. 19867. 4845. 5017. 0 0 667. 1034.
$AHR.poisson.n50.benchmark.tsv
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 35.1 00'35" 4435. 17841. 914. 1264. 0 0 258. 94.6
$AHR.poisson.n500.benchmark.tsv
# A tibble: 1 × 10
s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
<dbl> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 300. 04'59" 9814. 23201. 6805. 7132. 0 0 706. 2121.
List Columns Within Data Frames
Back To The Penguins
|>
penguins summarise(
lm = list(
lm(bill_length_mm ~ body_mass_g)
),.by = c(species)
)
# A tibble: 3 × 2
species lm
<fct> <list>
1 Adelie <lm>
2 Gentoo <lm>
3 Chinstrap <lm>
- We might now have some ideas about this
- There are multiple ways to get where we’re going
A Challenge
- From here we’d need
bind_rows()
thendplyr::filter()
- See if you can figure it out
- Make a barplot with standard error bars
- Make an alterative figure, showing the confidence intervals
\(\implies\) show the slope as a point and usegeom_errorbar_h()