Functions and Iteration

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 10, 2025

Functions

Now familiar with using functions
Writing our own functions is an everyday skill in R
Sometimes complex \(\implies\) often very simple
Mostly “inline” functions for simple data manipulation
- Very common for axis labels in ggplot()
- Required for across() in dplyr

library(tidyverse)
library(palmerpenguins)

Using `rename_with()`

dplyr allows you to rename columns of a data.frame using rename_with()
Requires a function

penguins |> 
  rename_with(str_to_title)

# A tibble: 344 × 8
   Species Island    Bill_length_mm Bill_depth_mm Flipper_length_mm Body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: Sex <fct>, Year <int>

How could we replace the underscores with a space and return everything in Title Case?

Using `across()`

Sometimes we wish to perform an identical operation across multiple columns
- Find the max, min, mean, sd etc
- Format in a similar way
The function across() is very powerful for this type of operation
Demonstrate using RA Fisher’s “iris” data
- Measure four variables for 3 species of iris

## Preview the data.frame called 'iris'
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Using `across()`

We can easily find the mean of each numeric column
- Noting that the names all finish with ‘th’ \(\implies\) use ends_with()

iris |>
  as_tibble() |>
  summarise(
    ## We specify the columns using tidy syntax, then pass a function
    across(.cols = ends_with("th"), .fns = mean),
    .by = Species
  )

# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03

Using `across()`

We can actually apply multiple functions by passing a named list
- Functions are just R objects

iris |>
  as_tibble() |>
  summarise(
    ## Specify the columns using tidy syntax, then pass a named list of functions
    across(.cols = ends_with("th"), .fns = list(mn = mean, sd = sd)),
    .by = Species
  )

# A tibble: 3 × 9
  Species    Sepal.Length_mn Sepal.Length_sd Sepal.Width_mn Sepal.Width_sd
  <fct>                <dbl>           <dbl>          <dbl>          <dbl>
1 setosa                5.01           0.352           3.43          0.379
2 versicolor            5.94           0.516           2.77          0.314
3 virginica             6.59           0.636           2.97          0.322
# ℹ 4 more variables: Petal.Length_mn <dbl>, Petal.Length_sd <dbl>,
#   Petal.Width_mn <dbl>, Petal.Width_sd <dbl>

Using `across()`

We could easily wrangle this using some pivot_*() functions

iris |>
  as_tibble() |>
  summarise(
    across(.cols = ends_with("th"), .fns = list(mn = mean, sd = sd)),
    .by = Species
  ) |>
  pivot_longer(cols = contains("_")) |>
  separate(name, into = c("feature", "stat"), sep = "_") |>
  pivot_wider(names_from = stat, values_from = value)

# A tibble: 12 × 4
   Species    feature         mn    sd
   <fct>      <chr>        <dbl> <dbl>
 1 setosa     Sepal.Length 5.01  0.352
 2 setosa     Sepal.Width  3.43  0.379
 3 setosa     Petal.Length 1.46  0.174
 4 setosa     Petal.Width  0.246 0.105
 5 versicolor Sepal.Length 5.94  0.516
 6 versicolor Sepal.Width  2.77  0.314
 7 versicolor Petal.Length 4.26  0.470
 8 versicolor Petal.Width  1.33  0.198
 9 virginica  Sepal.Length 6.59  0.636
10 virginica  Sepal.Width  2.97  0.322
11 virginica  Petal.Length 5.55  0.552
12 virginica  Petal.Width  2.03  0.275

Using `across()`

Applying this to the penguins dataset is not so easy
Missing values will cause mean() (& sd()) to produce NA
- NA values may appear differently in different columns
- Removing rows may not be suitable

Checking Missing Values

if_any() and if_all() are similar to across(), but apply logical tests
Can also take a list of functions

## Find all the missing values in the dataset
penguins |>
  as_tibble() |>
  dplyr::filter(
    ## if_any() is like a version of across, but performing logical tests
    if_any(.cols = everything(), .fns = is.na)
  )

# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           NA            NA                  NA          NA
 2 Adelie  Torgersen           34.1          18.1               193        3475
 3 Adelie  Torgersen           42            20.2               190        4250
 4 Adelie  Torgersen           37.8          17.1               186        3300
 5 Adelie  Torgersen           37.8          17.3               180        3700
 6 Adelie  Dream               37.5          18.9               179        2975
 7 Gentoo  Biscoe              44.5          14.3               216        4100
 8 Gentoo  Biscoe              46.2          14.4               214        4650
 9 Gentoo  Biscoe              47.3          13.8               216        4725
10 Gentoo  Biscoe              44.5          15.7               217        4875
11 Gentoo  Biscoe              NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

Trying To Use `across()` With `penguins`

There are no missing values for Chinstrap \(\implies\) mean is returned
- NA values for the other species

penguins |>
  as_tibble() |>
  summarise(
    ## Select all numeric columns using `where()`
    ## This applies a logical test to each column & selects it if TRUE
    across(where(is.numeric), mean), .by = species
  )

# A tibble: 3 × 6
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
  <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
1 Adelie              NA            NA                 NA          NA  2008.
2 Gentoo              NA            NA                 NA          NA  2008.
3 Chinstrap           48.8          18.4              196.       3733. 2008.

Trying To Use `across()` With `penguins`

We know that mean() can take the argument na.rm = TRUE
How can we pass that to mean here?

We write an inline function using \(x)

penguins |>
  as_tibble() |>
  summarise(
    across(where(is.numeric), \(x) mean(x, na.rm = TRUE)), 
    .by = species
  )

# A tibble: 3 × 6
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
  <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
1 Adelie              38.8          18.3              190.       3701. 2008.
2 Gentoo              47.5          15.0              217.       5076. 2008.
3 Chinstrap           48.8          18.4              196.       3733. 2008.

Inline Functions

This is an every day process in R
- Similar to above
- Modifying labels in plots
- Modifying factor levels
We need to first learn about functions a bit more

How Functions Are Defined

Functions have three key components

The arguments also known as the formals()
The code that is executed known as the body()
Their own environment
- When we pass data to a function it is renamed internally
- Everything is executed in a separate environment to the GlobalEnvironment

Function Arguments

The function sd() is a beautifully simple one
Check the help page: ?sd
The arguments are:
- x: a numeric vector
- na.rm: a logical value

formals(sd)

$x


$na.rm
[1] FALSE

Function Arguments

Notice that the default value for na.rm is visible, but x is empty
- We need to provide x
Any data we pass to sd is passed to the internal environment as x
- Doesn’t change in the Global Environment

The Functon Body

We can look at the code executed by a function by calling body()

body(sd)

sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))

The Functon Body

To reformat that & make it look nicer:

## If x is not a factor or vector, try coercing x to being a double
## If not possible, the function will error by default here
if (!(is.vector(x) || is.factor(x))) {
  x <- as.double(x)
}
## Now we have x in a suitable type of vector, find the square root of the variance
sqrt(var(x), na.rm = na.rm)

Any vector we pass as x can be manipulated as x inside the function’s environment
No changes made to the original vector in the Global Environment

Writing Our Own Functions

Writing Our Own Function

Before we write a brief inline function \(\rightarrow\) let’s write a more formal one
We’ll take a vector and transform everything to a \(Z\)-score
First we decide on the function name: z_score
- Just like a standard R object
The contents of the R object are some R code

z_score <- function(x, na.rm = FALSE) {
  ## The key elements we need for a Z-score are the mean & SD of a vector
  mn <- mean(x, na.rm = na.rm)
  sd <- sd(x, na.rm = na.rm)
  ## To calculate the z-score we subtract the mean, then divide by the SD
  ## The last line executed is what the function returns
  (x - mn) / sd # No need to assign this internally to an object
}

Writing Our Own Function

To run this function, we simply pass a vector to it
Because it’s a function, the vector will go inside the brackets

## Create a vector
some_num <- c(1:10, 20)
## Now calculate the z-scores
z_score(some_num)

 [1] -1.11224480 -0.92107772 -0.72991065 -0.53874357 -0.34757650 -0.15640942
 [7]  0.03475765  0.22592472  0.41709180  0.60825887  2.51992962

Writing Our Own Function

Let’s look inside the function using browser()
This will pause execution of the function, allowing us to see inside the function’s environment
You might feel like RStudio has gone a bit weird

z_score <- function(x, na.rm = FALSE) {
  browser() # Pause execution as soon as we call the function
  ## The key elements we need for a Z-score are the mean & SD of a vector
  mn <- mean(x, na.rm = na.rm)
  sd <- sd(x, na.rm = na.rm)
  ## To calculate the z-score we subtract the mean, then divide by the SD
  ## The last line executed is what the function returns
  (x - mn) / sd # No need to assign this internally to an object
}
z_score(some_num)

Writing Our Own Function

Notice in the Environment Tab, we’re now inside z_score()
The only values are na.rm and x

The Console should display Browse[1]>
We can check the contents of the function environment by typing ls()
- The only objects are na.rm and x

Type x and see what you get
- It should be the same as some_num

Writing Our Own Function

Copy & paste the first line of the function into your Console
Type ls()
- There should now be an object mn
- This exists only within the function’s environment
Repeat for the next line
- There should now be an object sd
Execute the last line
- This is what the function returns

Type Q to exit the browser & return to the Global Environment
mn and sd no longer exist \(\implies\) some_num is unchanged

Writing Our Own Function

Since R v4.0 the shorthand \(x) is the same as function(x)
- Much faster for lazy people
- Also cleaner for inline functions

Also note how RStudio managed your indentation
Everything inside the function was given 2 (or 4) spaces
This makes it clear where the code is being execute when you read it

The Ellipsis (`...`)

R has a very unique feature using the syntax ...
You may have seen this in multiple help pages
Allows arguments to be passed internally to functions without being defined
- Makes it very powerful but a little dangerous

Check the help page for mean()

The Ellipsis (`...`)

Let’s add this to our function

z_score <- function(x, na.rm = FALSE, ...) {
  ## The key elements we need for a Z-score are the mean & SD of a vector
  ## Include the ellipsis here for any additional arguments
  mn <- mean(x, na.rm = na.rm, ...)
  sd <- sd(x, na.rm = na.rm)
  ## To calculate the z-score we subtract the mean, then divide by the SD
  ## The last line executed is what the function returns
  (x - mn) / sd # No need to assign this internally to an object
}
z_score(some_num)

 [1] -1.11224480 -0.92107772 -0.72991065 -0.53874357 -0.34757650 -0.15640942
 [7]  0.03475765  0.22592472  0.41709180  0.60825887  2.51992962

The Ellipsis (`...`)

We know that mean can take an argument trim
Let’s see what happens

z_score(some_num, trim = 0.1)

 [1] -0.9558354 -0.7646683 -0.5735012 -0.3823341 -0.1911671  0.0000000
 [7]  0.1911671  0.3823341  0.5735012  0.7646683  2.6763390

This argument was passed to mean and the outermost 10% of observations excluded
What might’ve happened if we’d passed this to sd() internally?
- An error!!! sd() can’t take an argument called trim

Closing Comments

S3 Method Dispatch

The most common class system in R is the S3 class
Can make looking inside functions frustrating

Look inside the function mean using body(mean)
- UseMethod("mean")

This relies on the idea that multiple versions of mean exist
Have been defined for objects of different classes

S3 Method Dispatch

To see all of the versions of mean that exist

methods(mean)

[1] mean.Date        mean.default     mean.difftime    mean.POSIXct    
[5] mean.POSIXlt     mean.quosure*    mean.vctrs_vctr*
see '?methods' for accessing help and source code

The class is listed after mean.
- An asterisk means the function is hidden from our eyes 🤷
When mean is called on an object:
- The class of the object is checked
- A matching method is found if possible
- If no method is found: use mean.default()

S3 Method Dispatch

For numeric vectors, mean.default() will be called

## Look inside the default function
body(mean.default)

See if you can follow what’s happening
A bunch of checks are performed
The length is found
Trimming is performed if requested

.Internal(mean(x)) is called
- .Internal means the function is built right into the core R code
- Not for hacks like us \(\implies\) only for R Core

S3 Method Dispatch

Many functions operate like this

methods(print)
methods(summary)

Challenge

Try creating an inline function to rename penguins in Title Case
You’ll need to
1. remove underscores & replace with spaces
2. convert to title case
3. decide what to do with mm (or Mm)

## As a starting hint
penguins |> 
  rename_with(
    \(x) {
      str_to_title(x)
    }
  )

Functions and Iteration

Functions

Functions

Using rename_with()

Using across()

Using across()

Using across()

Using across()

Using across()

Checking Missing Values

Trying To Use across() With penguins

Trying To Use across() With penguins

Inline Functions

How Functions Are Defined

Function Arguments

Function Arguments

The Functon Body

The Functon Body

Writing Our Own Functions

Writing Our Own Function

Writing Our Own Function

Writing Our Own Function

Writing Our Own Function

Writing Our Own Function

Writing Our Own Function

The Ellipsis (...)

The Ellipsis (...)

The Ellipsis (...)

Closing Comments

S3 Method Dispatch

S3 Method Dispatch

S3 Method Dispatch

S3 Method Dispatch

Challenge

Using `rename_with()`

Using `across()`

Using `across()`

Using `across()`

Using `across()`

Using `across()`

Trying To Use `across()` With `penguins`

Trying To Use `across()` With `penguins`

The Ellipsis (`...`)

The Ellipsis (`...`)

The Ellipsis (`...`)