Iteration And Lists

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 10, 2025

Iteration

library(tidyverse)
library(palmerpenguins)
library(ggpmisc)
theme_set(theme_bw())

Introducing Iteration

R is explicitly designed to work with vectors
Where it’s elegance, power and speed comes from
We can avoid stepping through each value like many other languages

BUT

If we have a vector of file paths, how would we load them all?
If we have a list of linear models, how do we deal with this?
If we have multiple cell types subjected to the same experimental treatment, how do we combine and compare results?

Introducing Iteration

What if we fitted a linear model within each penguin species?
- Aiming to make a barplot comparing slopes including standard errors
We can do this kind of thing with summarise

penguins |>
  summarise(
    lm = list(
      lm(bill_length_mm ~ body_mass_g)
    ),
    .by = c(species)
  )

# A tibble: 3 × 2
  species   lm    
  <fct>     <list>
1 Adelie    <lm>  
2 Gentoo    <lm>  
3 Chinstrap <lm>

By wrapping the function lm as a list \(\implies\) a list column
Each column of a data.frame is just a vector \(\implies\) can also be a list

But how do we get the slopes & standard errors now?

Introducing Iteration

The obvious answer is we do one at a time, then move to the next
\(\implies\)This is iteration
R has special tricks for this which speed things up profoundly
Lists become the absolute workhorse for this type of approach

Let’s start the slow way and speed things up

Stepping Through A Vector

In the directory data/benchmarks are 6 very similar files
Running a parameter sweep on phoenix (UofA HPC)
- Simulated DNA sequences with AHR motifs
- Fitting a poisson model to test for enrichment relative to control
- Changing the size of the control dataset (n = 10, 50, 100, 250, 500, 100)
- Reporting the resource usage \(\implies\) a simple 2-line file

f <- here::here("data", "benchmarks") |> 
  list.files(pattern = "tsv$", full.names = TRUE)
f

[1] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n10.benchmark.tsv"  
[2] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n100.benchmark.tsv" 
[3] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n1000.benchmark.tsv"
[4] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n250.benchmark.tsv" 
[5] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n50.benchmark.tsv"  
[6] "/home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n500.benchmark.tsv"

Stepping Through A Vector

Looking at the first file

read_tsv(f[[1]])

# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8

Columns represent time taken, various RAM measures, I/O etc
Is standard output from snakemake when you benchmark a rule

Stepping Through A Vector

The simplest way would be to use a for loop
- seq_along() will create a sequence of integers along a vector
- Using for (i in seq_along(f)) will step through the vector one at a time

seq_along(f)

[1] 1 2 3 4 5 6

for (i in seq_along(f)) {
    message("For i = ", i, ", open ", f[[i]])
}

For i = 1, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n10.benchmark.tsv

For i = 2, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n100.benchmark.tsv

For i = 3, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n1000.benchmark.tsv

For i = 4, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n250.benchmark.tsv

For i = 5, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n50.benchmark.tsv

For i = 6, open /home/stevie/TKI/RAdelaide25/data/benchmarks/AHR.poisson.n500.benchmark.tsv

Stepping Through A Vector

Now we know how a for loop works
\(\implies\) What kind of object should we place each file into

My suggestion would be a list
We can create an empty list using list()
- Then we can just add elements

df_list <- list()
for (i in seq_along(f)) {
  ## Step through the vector, loading each file one at a time
  df_list[[i]] <- read_tsv(f[[i]])
}

Stepping Through A Vector

It might be useful to add names to this \(\implies\) identify which file is which

## Give each list element a name, dropping the bulk of the file path
names(df_list) <- basename(f)
df_list

$AHR.poisson.n10.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8

$AHR.poisson.n100.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.     329.

$AHR.poisson.n1000.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  594. 09'54"   15535.  28940.   9731.  10368.     0      0      700.    4382.

$AHR.poisson.n250.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  154. 02'34"    6462.  19867.   4845.   5017.     0      0      667.    1034.

$AHR.poisson.n50.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.     94.6

$AHR.poisson.n500.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  300. 04'59"    9814.  23201.   6805.   7132.     0      0      706.    2121.

Working With Lists

We can use bind_rows() to form a single tibble
- Including the .id argument will add list names to a column

df_list |> bind_rows(.id = "file")

# A tibble: 6 × 11
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 1 more variable: cpu_time <dbl>

From here, use mutate() to extract n = 10, 50, …
Make a lovely plot

Using an `R`-Style Approach

R offers an approach using lapply()
- Stands for list-apply
- We apply a function to each element of a vector
- Will always return a list

lapply(f, read_tsv)

[[1]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8

[[2]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.     329.

[[3]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  594. 09'54"   15535.  28940.   9731.  10368.     0      0      700.    4382.

[[4]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  154. 02'34"    6462.  19867.   4845.   5017.     0      0      667.    1034.

[[5]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.     94.6

[[6]]
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  300. 04'59"    9814.  23201.   6805.   7132.     0      0      706.    2121.

Using an `R`-Style Approach

If the input vector has names \(\implies\) output will have names
Let’s repeat including names

f |>
  setNames(basename(f)) |> # Add names here to ensure the list has names
  lapply(read_tsv) |>
  bind_rows(.id = "file")

# A tibble: 6 × 11
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 1 more variable: cpu_time <dbl>

Using an `R`-Style Approach

The complete process could look like this

f |>
  setNames(basename(f)) |> # Add names here to ensure the list has names
  lapply(read_tsv) |>
  bind_rows(.id = "file") |>
  mutate(bg_size = str_extract(file, "[0-9]+") |> as.integer())

# A tibble: 6 × 12
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 2 more variables: cpu_time <dbl>, bg_size <int>

Using an `R`-Style Approach

And now make the figure we need

f |>
  setNames(basename(f)) |>
  lapply(read_tsv) |>
  bind_rows(.id = "file") |>
  mutate(
    bg_size = file |>
      str_extract("[0-9]+") |>
      as.integer()
  ) |>
  ggplot(aes(bg_size, s)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm") +
  stat_poly_eq(use_label("eq")) +
  labs(
    x = "BG Size", 
    y = "Time Taken (sec)"
  )

Beyond `lapply()`

Using `apply()`

A common scenario is to perform an operation on a matrix
- Can summarise by row or column
For this we use apply() including the MARGIN argument
- For rows: MARGIN = 1
- For columns: MARGIN = 2
Can even be applied to 3D arrays using MARGIN=3

Using `apply()`

AirPassengers is a time-series object
- Just a vector dressed up to look like a matrix
- We can coerce to a matrix as below

## Coerce the time-series object to be an actual matrix
AirPassengers |>
  matrix(ncol = 12, byrow = TRUE, dimnames = list(1949:1960, month.abb))

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Using `apply()`

We could summarise by year or by month

## Start by finding the mean in every month
AirPassengers |>
  matrix(ncol = 12, byrow = TRUE, dimnames = list(1949:1960, month.abb)) |>
  apply(MARGIN = 2, FUN = mean)

     Jan      Feb      Mar      Apr      May      Jun      Jul      Aug 
241.7500 235.0000 270.1667 267.0833 271.8333 311.6667 351.3333 351.0833 
     Sep      Oct      Nov      Dec 
302.4167 266.5833 232.8333 261.8333

Now find the total number of passengers each year
- use apply() with MARGIN = 1

This is a pretty common operation
\(\implies\) rowSums(), colSums(), rowMeans() and colMeans() all exist

Alternatives to `lapply()`

A common alternative to lapply() is sapply()
- Tries to simplify by default
- Automatically uses the elements of x as names

sapply(f, read_tsv)
sapply(f, read_tsv, simplify = FALSE)

Alternatives to `lapply()`

vapply() is for when you need a vector as output
- Output type must be strictly defined

f |>
  lapply(read_tsv) |>
  setNames(basename(f)) |>
  # Apply the function `nrow` knowing we will return an integer with length(1)
  vapply(nrow, integer(1))

  AHR.poisson.n10.benchmark.tsv  AHR.poisson.n100.benchmark.tsv 
                              1                               1 
AHR.poisson.n1000.benchmark.tsv  AHR.poisson.n250.benchmark.tsv 
                              1                               1 
  AHR.poisson.n50.benchmark.tsv  AHR.poisson.n500.benchmark.tsv 
                              1                               1

The Package `purrr`

The tidyverse package purrr reimplements these using map()
The idea is we map an input to an output
- map() mostly replicates lapply()

f |> 
  setNames(basename(f)) |>
  map(read_tsv)

$AHR.poisson.n10.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.     49.8

$AHR.poisson.n100.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.     329.

$AHR.poisson.n1000.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  594. 09'54"   15535.  28940.   9731.  10368.     0      0      700.    4382.

$AHR.poisson.n250.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  154. 02'34"    6462.  19867.   4845.   5017.     0      0      667.    1034.

$AHR.poisson.n50.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.     94.6

$AHR.poisson.n500.benchmark.tsv
# A tibble: 1 × 10
      s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
  <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>    <dbl>
1  300. 04'59"    9814.  23201.   6805.   7132.     0      0      706.    2121.

The Package `purrr`

map_int() is like vapply() setting integer(1) as the output
Same for map_dbl(), map_chr(), map_lgl()
map_dfr() is a little different \(\implies\) will perform the bind_rows() operation

f |> 
  setNames(basename(f)) |>
  map_dfr(read_tsv, .id = "file")

# A tibble: 6 × 11
  file          s `h:m:s` max_rss max_vms max_uss max_pss io_in io_out mean_load
  <chr>     <dbl> <time>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>
1 AHR.pois…  11.2 00'11"    4172.  17559.    829.   1194.     0      0      385.
2 AHR.pois…  65.2 01'05"    4950.  18337.   3986.   4086.     0      0      496.
3 AHR.pois… 594.  09'54"   15535.  28940.   9731.  10368.     0      0      700.
4 AHR.pois… 154.  02'34"    6462.  19867.   4845.   5017.     0      0      667.
5 AHR.pois…  35.1 00'35"    4435.  17841.    914.   1264.     0      0      258.
6 AHR.pois… 300.  04'59"    9814.  23201.   6805.   7132.     0      0      706.
# ℹ 1 more variable: cpu_time <dbl>

List Columns Within Data Frames

Back To The Penguins

penguins |>
  summarise(
    lm = list(
      lm(bill_length_mm ~ body_mass_g)
    ),
    .by = c(species)
  )

# A tibble: 3 × 2
  species   lm    
  <fct>     <list>
1 Adelie    <lm>  
2 Gentoo    <lm>  
3 Chinstrap <lm>

We might now have some ideas about this
There are multiple ways to get where we’re going

Back To The Penguins

penguins |>
  summarise(
    lm = list(lm(bill_length_mm ~ body_mass_g)),
    .by = c(species)
  ) |>
  mutate(
    ## Create a summary object for each linear model
    summary = lapply(lm, summary),
    ## Extract the body mass coefficient from each linear model
    slope = map_dbl(lm, \(x) coef(x)["body_mass_g"]),
    ## Extract the std.error for body mass
    se = map_dbl(
      summary, 
      \(x) coefficients(x)["body_mass_g", "Std. Error"]
    )
  )

# A tibble: 3 × 5
  species   lm     summary      slope       se
  <fct>     <list> <list>       <dbl>    <dbl>
1 Adelie    <lm>   <smmry.lm> 0.00319 0.000398
2 Gentoo    <lm>   <smmry.lm> 0.00409 0.000413
3 Chinstrap <lm>   <smmry.lm> 0.00446 0.000918

Back To The Penguins

An alternative might be to split the original data.frame

penguins |>
  ## Split by the factor levels in the species column
  split(f = penguins$species) |>
  ## Fit the linear model on each of the split data frames
  map(\(x) lm(bill_length_mm ~ body_mass_g, data = x)) |>
  ## Use broom::tidy as a convenience function
  map(broom::tidy, conf.int = TRUE)

A Challenge

From here we’d need bind_rows() then dplyr::filter()
- See if you can figure it out
- Make a barplot with standard error bars
- Make an alterative figure, showing the confidence intervals
  \(\implies\) show the slope as a point and use geom_errorbar_h()

Iteration And Lists

Iteration

Introducing Iteration

Introducing Iteration

Introducing Iteration

Stepping Through A Vector

Stepping Through A Vector

Stepping Through A Vector

Stepping Through A Vector

Stepping Through A Vector

Working With Lists

Using an R-Style Approach

Using an R-Style Approach

Using an R-Style Approach

Using an R-Style Approach

Beyond lapply()

Using apply()

Using apply()

Using apply()

Alternatives to lapply()

Alternatives to lapply()

The Package purrr

The Package purrr

List Columns Within Data Frames

Back To The Penguins

Back To The Penguins

Back To The Penguins

A Challenge

Using an `R`-Style Approach

Using an `R`-Style Approach

Using an `R`-Style Approach

Using an `R`-Style Approach

Beyond `lapply()`

Using `apply()`

Using `apply()`

Using `apply()`

Alternatives to `lapply()`

Alternatives to `lapply()`

The Package `purrr`

The Package `purrr`