Functions and Iteration

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 10, 2025

Functions

Functions

  • Now familiar with using functions
  • Writing our own functions is an everyday skill in R
  • Sometimes complex \(\implies\) often very simple
  • Mostly “inline” functions for simple data manipulation
    • Very common for axis labels in ggplot()
    • Required for across() in dplyr
library(tidyverse)
library(palmerpenguins)

Using rename_with()

  • dplyr allows you to rename columns of a data.frame using rename_with()
  • Requires a function
penguins |> 
  rename_with(str_to_title)
# A tibble: 344 × 8
   Species Island    Bill_length_mm Bill_depth_mm Flipper_length_mm Body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: Sex <fct>, Year <int>
  • How could we replace the underscores with a space and return everything in Title Case?

Using across()

  • Sometimes we wish to perform an identical operation across multiple columns
    • Find the max, min, mean, sd etc
    • Format in a similar way
  • The function across() is very powerful for this type of operation
  • Demonstrate using RA Fisher’s “iris” data
    • Measure four variables for 3 species of iris
## Preview the data.frame called 'iris'
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Using across()

  • We can easily find the mean of each numeric column
    • Noting that the names all finish with ‘th’ \(\implies\) use ends_with()
iris |>
  as_tibble() |>
  summarise(
    ## We specify the columns using tidy syntax, then pass a function
    across(.cols = ends_with("th"), .fns = mean),
    .by = Species
  )
# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03 

Using across()

  • We can actually apply multiple functions by passing a named list
    • Functions are just R objects
iris |>
  as_tibble() |>
  summarise(
    ## Specify the columns using tidy syntax, then pass a named list of functions
    across(.cols = ends_with("th"), .fns = list(mn = mean, sd = sd)),
    .by = Species
  )
# A tibble: 3 × 9
  Species    Sepal.Length_mn Sepal.Length_sd Sepal.Width_mn Sepal.Width_sd
  <fct>                <dbl>           <dbl>          <dbl>          <dbl>
1 setosa                5.01           0.352           3.43          0.379
2 versicolor            5.94           0.516           2.77          0.314
3 virginica             6.59           0.636           2.97          0.322
# ℹ 4 more variables: Petal.Length_mn <dbl>, Petal.Length_sd <dbl>,
#   Petal.Width_mn <dbl>, Petal.Width_sd <dbl>

Using across()

  • We could easily wrangle this using some pivot_*() functions
iris |>
  as_tibble() |>
  summarise(
    across(.cols = ends_with("th"), .fns = list(mn = mean, sd = sd)),
    .by = Species
  ) |>
  pivot_longer(cols = contains("_")) |>
  separate(name, into = c("feature", "stat"), sep = "_") |>
  pivot_wider(names_from = stat, values_from = value) 
# A tibble: 12 × 4
   Species    feature         mn    sd
   <fct>      <chr>        <dbl> <dbl>
 1 setosa     Sepal.Length 5.01  0.352
 2 setosa     Sepal.Width  3.43  0.379
 3 setosa     Petal.Length 1.46  0.174
 4 setosa     Petal.Width  0.246 0.105
 5 versicolor Sepal.Length 5.94  0.516
 6 versicolor Sepal.Width  2.77  0.314
 7 versicolor Petal.Length 4.26  0.470
 8 versicolor Petal.Width  1.33  0.198
 9 virginica  Sepal.Length 6.59  0.636
10 virginica  Sepal.Width  2.97  0.322
11 virginica  Petal.Length 5.55  0.552
12 virginica  Petal.Width  2.03  0.275

Using across()

  • Applying this to the penguins dataset is not so easy
  • Missing values will cause mean() (& sd()) to produce NA
    • NA values may appear differently in different columns
    • Removing rows may not be suitable

Checking Missing Values

  • if_any() and if_all() are similar to across(), but apply logical tests
  • Can also take a list of functions
## Find all the missing values in the dataset
penguins |>
  as_tibble() |>
  dplyr::filter(
    ## if_any() is like a version of across, but performing logical tests
    if_any(.cols = everything(), .fns = is.na)
  )
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           NA            NA                  NA          NA
 2 Adelie  Torgersen           34.1          18.1               193        3475
 3 Adelie  Torgersen           42            20.2               190        4250
 4 Adelie  Torgersen           37.8          17.1               186        3300
 5 Adelie  Torgersen           37.8          17.3               180        3700
 6 Adelie  Dream               37.5          18.9               179        2975
 7 Gentoo  Biscoe              44.5          14.3               216        4100
 8 Gentoo  Biscoe              46.2          14.4               214        4650
 9 Gentoo  Biscoe              47.3          13.8               216        4725
10 Gentoo  Biscoe              44.5          15.7               217        4875
11 Gentoo  Biscoe              NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

Trying To Use across() With penguins

  • There are no missing values for Chinstrap \(\implies\) mean is returned
    • NA values for the other species
penguins |>
  as_tibble() |>
  summarise(
    ## Select all numeric columns using `where()`
    ## This applies a logical test to each column & selects it if TRUE
    across(where(is.numeric), mean), .by = species
  )
# A tibble: 3 × 6
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
  <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
1 Adelie              NA            NA                 NA          NA  2008.
2 Gentoo              NA            NA                 NA          NA  2008.
3 Chinstrap           48.8          18.4              196.       3733. 2008.

Trying To Use across() With penguins

  • We know that mean() can take the argument na.rm = TRUE
  • How can we pass that to mean here?
  • We write an inline function using \(x)
penguins |>
  as_tibble() |>
  summarise(
    across(where(is.numeric), \(x) mean(x, na.rm = TRUE)), 
    .by = species
  )
# A tibble: 3 × 6
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
  <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
1 Adelie              38.8          18.3              190.       3701. 2008.
2 Gentoo              47.5          15.0              217.       5076. 2008.
3 Chinstrap           48.8          18.4              196.       3733. 2008.

Inline Functions

  • This is an every day process in R
    • Similar to above
    • Modifying labels in plots
    • Modifying factor levels
  • We need to first learn about functions a bit more

How Functions Are Defined

Functions have three key components

  1. The arguments also known as the formals()
  2. The code that is executed known as the body()
  3. Their own environment
    • When we pass data to a function it is renamed internally
    • Everything is executed in a separate environment to the GlobalEnvironment

Function Arguments

  • The function sd() is a beautifully simple one
  • Check the help page: ?sd
  • The arguments are:
    • x: a numeric vector
    • na.rm: a logical value
formals(sd)
$x


$na.rm
[1] FALSE

Function Arguments

  • Notice that the default value for na.rm is visible, but x is empty
    • We need to provide x
  • Any data we pass to sd is passed to the internal environment as x
    • Doesn’t change in the Global Environment

The Functon Body

  • We can look at the code executed by a function by calling body()
body(sd)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))

The Functon Body

  • To reformat that & make it look nicer:
## If x is not a factor or vector, try coercing x to being a double
## If not possible, the function will error by default here
if (!(is.vector(x) || is.factor(x))) {
  x <- as.double(x)
}
## Now we have x in a suitable type of vector, find the square root of the variance
sqrt(var(x), na.rm = na.rm)
  • Any vector we pass as x can be manipulated as x inside the function’s environment
  • No changes made to the original vector in the Global Environment

Writing Our Own Functions

Writing Our Own Function

  • Before we write a brief inline function \(\rightarrow\) let’s write a more formal one
  • We’ll take a vector and transform everything to a \(Z\)-score
  • First we decide on the function name: z_score
    • Just like a standard R object
  • The contents of the R object are some R code
z_score <- function(x, na.rm = FALSE) {
  ## The key elements we need for a Z-score are the mean & SD of a vector
  mn <- mean(x, na.rm = na.rm)
  sd <- sd(x, na.rm = na.rm)
  ## To calculate the z-score we subtract the mean, then divide by the SD
  ## The last line executed is what the function returns
  (x - mn) / sd # No need to assign this internally to an object
}

Writing Our Own Function

  • To run this function, we simply pass a vector to it
  • Because it’s a function, the vector will go inside the brackets
## Create a vector
some_num <- c(1:10, 20)
## Now calculate the z-scores
z_score(some_num)
 [1] -1.11224480 -0.92107772 -0.72991065 -0.53874357 -0.34757650 -0.15640942
 [7]  0.03475765  0.22592472  0.41709180  0.60825887  2.51992962

Writing Our Own Function

  • Let’s look inside the function using browser()
  • This will pause execution of the function, allowing us to see inside the function’s environment
  • You might feel like RStudio has gone a bit weird
z_score <- function(x, na.rm = FALSE) {
  browser() # Pause execution as soon as we call the function
  ## The key elements we need for a Z-score are the mean & SD of a vector
  mn <- mean(x, na.rm = na.rm)
  sd <- sd(x, na.rm = na.rm)
  ## To calculate the z-score we subtract the mean, then divide by the SD
  ## The last line executed is what the function returns
  (x - mn) / sd # No need to assign this internally to an object
}
z_score(some_num)

Writing Our Own Function

  • Notice in the Environment Tab, we’re now inside z_score()
  • The only values are na.rm and x
  • The Console should display Browse[1]>
  • We can check the contents of the function environment by typing ls()
    • The only objects are na.rm and x
  • Type x and see what you get
    • It should be the same as some_num

Writing Our Own Function

  1. Copy & paste the first line of the function into your Console
  2. Type ls()
    • There should now be an object mn
    • This exists only within the function’s environment
  3. Repeat for the next line
    • There should now be an object sd
  4. Execute the last line
    • This is what the function returns
  • Type Q to exit the browser & return to the Global Environment
  • mn and sd no longer exist \(\implies\) some_num is unchanged

Writing Our Own Function

  • Since R v4.0 the shorthand \(x) is the same as function(x)
    • Much faster for lazy people
    • Also cleaner for inline functions
  • Also note how RStudio managed your indentation
  • Everything inside the function was given 2 (or 4) spaces
  • This makes it clear where the code is being execute when you read it

The Ellipsis (...)

  • R has a very unique feature using the syntax ...
  • You may have seen this in multiple help pages
  • Allows arguments to be passed internally to functions without being defined
    • Makes it very powerful but a little dangerous
  • Check the help page for mean()

The Ellipsis (...)

  • Let’s add this to our function
z_score <- function(x, na.rm = FALSE, ...) {
  ## The key elements we need for a Z-score are the mean & SD of a vector
  ## Include the ellipsis here for any additional arguments
  mn <- mean(x, na.rm = na.rm, ...)
  sd <- sd(x, na.rm = na.rm)
  ## To calculate the z-score we subtract the mean, then divide by the SD
  ## The last line executed is what the function returns
  (x - mn) / sd # No need to assign this internally to an object
}
z_score(some_num)
 [1] -1.11224480 -0.92107772 -0.72991065 -0.53874357 -0.34757650 -0.15640942
 [7]  0.03475765  0.22592472  0.41709180  0.60825887  2.51992962

The Ellipsis (...)

  • We know that mean can take an argument trim
  • Let’s see what happens
z_score(some_num, trim = 0.1)
 [1] -0.9558354 -0.7646683 -0.5735012 -0.3823341 -0.1911671  0.0000000
 [7]  0.1911671  0.3823341  0.5735012  0.7646683  2.6763390
  • This argument was passed to mean and the outermost 10% of observations excluded
  • What might’ve happened if we’d passed this to sd() internally?
    • An error!!! sd() can’t take an argument called trim

Closing Comments

S3 Method Dispatch

  • The most common class system in R is the S3 class
  • Can make looking inside functions frustrating
  • Look inside the function mean using body(mean)
    • UseMethod("mean")
  • This relies on the idea that multiple versions of mean exist
  • Have been defined for objects of different classes

S3 Method Dispatch

  • To see all of the versions of mean that exist
methods(mean)
[1] mean.Date        mean.default     mean.difftime    mean.POSIXct    
[5] mean.POSIXlt     mean.quosure*    mean.vctrs_vctr*
see '?methods' for accessing help and source code
  • The class is listed after mean.
    • An asterisk means the function is hidden from our eyes 🤷
  • When mean is called on an object:
    • The class of the object is checked
    • A matching method is found if possible
    • If no method is found: use mean.default()

S3 Method Dispatch

  • For numeric vectors, mean.default() will be called
## Look inside the default function
body(mean.default)
  • See if you can follow what’s happening
  • A bunch of checks are performed
  • The length is found
  • Trimming is performed if requested
  • .Internal(mean(x)) is called
    • .Internal means the function is built right into the core R code
    • Not for hacks like us \(\implies\) only for R Core

S3 Method Dispatch

  • Many functions operate like this
methods(print)
methods(summary)

Challenge

  • Try creating an inline function to rename penguins in Title Case
  • You’ll need to
    1. remove underscores & replace with spaces
    2. convert to title case
    3. decide what to do with mm (or Mm)
## As a starting hint
penguins |> 
  rename_with(
    \(x) {
      str_to_title(x)
    }
  )