Iteration And Lists

RAdelaide 2025

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 10, 2025

Iteration

library(tidyverse)
library(palmerpenguins)
library(ggpmisc)
theme_set(theme_bw())

Introducing Iteration

  • R is explicitly designed to work with vectors
  • Where it’s elegance, power and speed comes from
  • We can avoid stepping through each value like many other languages

BUT

  • If we have a vector of file paths, how would we load them all?
  • If we have a list of linear models, how do we deal with this?
  • If we have multiple cell types subjected to the same experimental treatment, how do we combine and compare results?

Introducing Iteration

  • What if we fitted a linear model within each penguin species?
    • Aiming to make a barplot comparing slopes including standard errors
  • We can do this kind of thing with summarise
penguins |>
  summarise(
    lm = list(
      lm(bill_length_mm ~ body_mass_g)
    ),
    .by = c(species)
  )
# A tibble: 3 × 2
  species   lm    
  <fct>     <list>
1 Adelie    <lm>  
2 Gentoo    <lm>  
3 Chinstrap <lm>  
  • By wrapping the function lm as a list \(\implies\) a list column
  • Each column of a data.frame is just a vector \(\implies\) can also be a list

But how do we get the slopes & standard errors now?

Introducing Iteration

  • The obvious answer is we do one at a time, then move to the next
    \(\implies\)This is iteration
  • R has special tricks for this which speed things up profoundly
  • Lists become the absolute workhorse for this type of approach
  • Let’s start the slow way and speed things up

Stepping Through A Vector

  • In the directory data/benchmarks are 6 very similar files
  • Running a parameter sweep on phoenix (UofA HPC)
    • Simulated DNA sequences with AHR motifs
    • Fitting a poisson model to test for enrichment relative to control
    • Changing the size of the control dataset (n = 10, 50, 100, 250, 500, 100)
    • Reporting the resource usage \(\implies\) a simple 2-line file
f <- here::here("data", "benchmarks") |> 
  list.files(pattern = "tsv$", full.names = TRUE)
f
[1] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n10.benchmark.tsv"  
[2] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n100.benchmark.tsv" 
[3] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n1000.benchmark.tsv"
[4] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n250.benchmark.tsv" 
[5] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n50.benchmark.tsv"  
[6] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n500.benchmark.tsv" 

Stepping Through A Vector

  • Looking at the first file
read_tsv(f[[1]])
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8
  • Columns represent time taken, various RAM measures, I/O etc
  • Is standard output from snakemake when you benchmark a rule

Stepping Through A Vector

  • The simplest way would be to use a for loop
    • seq_along() will create a sequence of integers along a vector
    • Using for (i in seq_along(f)) will step through the vector one at a time
seq_along(f)
[1] 1 2 3 4 5 6
for (i in seq_along(f)) {
    message("For i = ", i, ", open ", f[[i]])
}
For i = 1, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n10.benchmark.tsv
For i = 2, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n100.benchmark.tsv
For i = 3, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n1000.benchmark.tsv
For i = 4, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n250.benchmark.tsv
For i = 5, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n50.benchmark.tsv
For i = 6, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n500.benchmark.tsv

Stepping Through A Vector

  • Now we know how a for loop works
    \(\implies\) What kind of object should we place each file into
  • My suggestion would be a list
  • We can create an empty list using list()
    • Then we can just add elements
df_list <- list()
for (i in seq_along(f)) {
  ## Step through the vector, loading each file one at a time
  df_list[[i]] <- read_tsv(f[[i]])
}

Stepping Through A Vector

  • It might be useful to add names to this \(\implies\) identify which file is which
## Give each list element a name, dropping the bulk of the file path
names(df_list) <- basename(f)
df_list
$AHR.poisson.n10.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8

$AHR.poisson.n100.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.     329.

$AHR.poisson.n1000.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  594. 09'54"   15535.  28940.   9731.  10368.     0      0      700.    4382.

$AHR.poisson.n250.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  154. 02'34"    6462.  19867.   4845.   5017.     0      0      667.    1034.

$AHR.poisson.n50.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.     94.6

$AHR.poisson.n500.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  300. 04'59"    9814.  23201.   6805.   7132.     0      0      706.    2121.

Working With Lists

  • We can use bind_rows() to form a single tibble
    • Including the .id argument will add list names to a column
df_list |> bind_rows(.id = "file")
# A tibble: 6 × 11
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 1 more variable: cpu_time <dbl>
  • From here, use mutate() to extract n = 10, 50, …
  • Make a lovely plot

Using an R-Style Approach

  • R offers an approach using lapply()
    • Stands for list-apply
    • We apply a function to each element of a vector
    • Will always return a list
lapply(f, read_tsv)
[[1]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8

[[2]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.     329.

[[3]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  594. 09'54"   15535.  28940.   9731.  10368.     0      0      700.    4382.

[[4]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  154. 02'34"    6462.  19867.   4845.   5017.     0      0      667.    1034.

[[5]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.     94.6

[[6]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  300. 04'59"    9814.  23201.   6805.   7132.     0      0      706.    2121.

Using an R-Style Approach

  • If the input vector has names \(\implies\) output will have names
  • Let’s repeat including names
f |>
  setNames(basename(f)) |> # Add names here to ensure the list has names
  lapply(read_tsv) |>
  bind_rows(.id = "file")
# A tibble: 6 × 11
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 1 more variable: cpu_time <dbl>

Using an R-Style Approach

  • The complete process could look like this
f |>
  setNames(basename(f)) |> # Add names here to ensure the list has names
  lapply(read_tsv) |>
  bind_rows(.id = "file") |>
  mutate(bg_size = str_extract(file, "[0-9]+") |> as.integer())
# A tibble: 6 × 12
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 2 more variables: cpu_time <dbl>, bg_size <int>

Using an R-Style Approach

  • And now make the figure we need
f |>
  setNames(basename(f)) |>
  lapply(read_tsv) |>
  bind_rows(.id = "file") |>
  mutate(
    bg_size = file |>
      str_extract("[0-9]+") |>
      as.integer()
  ) |>
  ggplot(aes(bg_size, s)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm") +
  stat_poly_eq(use_label("eq")) +
  labs(
    x = "BG Size", 
    y = "Time Taken (sec)"
  ) 

Beyond lapply()

Using apply()

  • A common scenario is to perform an operation on a matrix
    • Can summarise by row or column
  • For this we use apply() including the MARGIN argument
    • For rows: MARGIN = 1
    • For columns: MARGIN = 2
  • Can even be applied to 3D arrays using MARGIN=3

Using apply()

  • AirPassengers is a time-series object
    • Just a vector dressed up to look like a matrix
    • We can coerce to a matrix as below
## Coerce the time-series object to be an actual matrix
AirPassengers |>
  matrix(ncol = 12, byrow = TRUE, dimnames = list(1949:1960, month.abb)) 
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Using apply()

  • We could summarise by year or by month
## Start by finding the mean in every month
AirPassengers |>
  matrix(ncol = 12, byrow = TRUE, dimnames = list(1949:1960, month.abb)) |>
  apply(MARGIN = 2, FUN = mean)
     Jan      Feb      Mar      Apr      May      Jun      Jul      Aug 
241.7500 235.0000 270.1667 267.0833 271.8333 311.6667 351.3333 351.0833 
     Sep      Oct      Nov      Dec 
302.4167 266.5833 232.8333 261.8333 
  • Now find the total number of passengers each year
    • use apply() with MARGIN = 1
  • This is a pretty common operation
    \(\implies\) rowSums(), colSums(), rowMeans() and colMeans() all exist

Alternatives to lapply()

  • Can be a little unpredictable as seen here
  • A common alternative to lapply() is sapply()
    • Tries to simplify by default
    • Automatically uses the elements of x as names
sapply(f, read_tsv)
sapply(f, read_tsv, simplify = FALSE)

Alternatives to lapply()

  • vapply() is for when you need a vector as output
    • Output type must be strictly defined
f |>
  lapply(read_tsv) |>
  setNames(basename(f)) |>
  # Apply the function `nrow` knowing we will return an integer with length(1)
  vapply(nrow, integer(1))
  AHR.poisson.n10.benchmark.tsv  AHR.poisson.n100.benchmark.tsv 
                              1                               1 
AHR.poisson.n1000.benchmark.tsv  AHR.poisson.n250.benchmark.tsv 
                              1                               1 
  AHR.poisson.n50.benchmark.tsv  AHR.poisson.n500.benchmark.tsv 
                              1                               1 

The Package purrr

  • The tidyverse package purrr reimplements these using map()
  • The idea is we map an input to an output
    • map() mostly replicates lapply()
f |> 
  setNames(basename(f)) |>
  map(read_tsv)
$AHR.poisson.n10.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8

$AHR.poisson.n100.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.     329.

$AHR.poisson.n1000.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  594. 09'54"   15535.  28940.   9731.  10368.     0      0      700.    4382.

$AHR.poisson.n250.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  154. 02'34"    6462.  19867.   4845.   5017.     0      0      667.    1034.

$AHR.poisson.n50.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.     94.6

$AHR.poisson.n500.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  300. 04'59"    9814.  23201.   6805.   7132.     0      0      706.    2121.

The Package purrr

  • map_int() is like vapply() setting integer(1) as the output
  • Same for map_dbl(), map_chr(), map_lgl()
  • map_dfr() is a little different \(\implies\) will perform the bind_rows() operation
f |> 
  setNames(basename(f)) |>
  map_dfr(read_tsv, .id = "file")
# A tibble: 6 × 11
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 1 more variable: cpu_time <dbl>

List Columns Within Data Frames

Back To The Penguins

penguins |>
  summarise(
    lm = list(
      lm(bill_length_mm ~ body_mass_g)
    ),
    .by = c(species)
  ) 
# A tibble: 3 × 2
  species   lm    
  <fct>     <list>
1 Adelie    <lm>  
2 Gentoo    <lm>  
3 Chinstrap <lm>  
  • We might now have some ideas about this
  • There are multiple ways to get where we’re going

Back To The Penguins

penguins |>
  summarise(
    lm = list(lm(bill_length_mm ~ body_mass_g)),
    .by = c(species)
  ) |>
  mutate(
    ## Create a summary object for each linear model
    summary = lapply(lm, summary),
    ## Extract the body mass coefficient from each linear model
    slope = map_dbl(lm, \(x) coef(x)["body_mass_g"]),
    ## Extract the std.error for body mass
    se = map_dbl(
      summary, 
      \(x) coefficients(x)["body_mass_g", "Std. Error"]
    )
  )
# A tibble: 3 × 5
  species   lm     summary      slope       se
  <fct>     <list> <list>       <dbl>    <dbl>
1 Adelie    <lm>   <smmry.lm> 0.00319 0.000398
2 Gentoo    <lm>   <smmry.lm> 0.00409 0.000413
3 Chinstrap <lm>   <smmry.lm> 0.00446 0.000918

Back To The Penguins

  • An alternative might be to split the original data.frame
penguins |>
  ## Split by the factor levels in the species column
  split(f = penguins$species) |>
  ## Fit the linear model on each of the split data frames
  map(\(x) lm(bill_length_mm ~ body_mass_g, data = x)) |>
  ## Use broom::tidy as a convenience function
  map(broom::tidy, conf.int = TRUE)

A Challenge

  • From here we’d need bind_rows() then dplyr::filter()
    • See if you can figure it out
    • Make a barplot with standard error bars
    • Make an alterative figure, showing the confidence intervals
      \(\implies\) show the slope as a point and use geom_errorbar_h()