Using Functions in Series

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 8, 2025

The Pipe Operator

Motivation

We’ve seen a bunch of ways to explore our data
Only using one function at a time

Perhaps we might like to subset (i.e. filter) our data, then sort it
How do we do that?

The Ugly (Old-School) Way

Let’s say we only want information on the Adelie penguins
Then we want to sort this by body_mass

## First we can subset the dataset
filter(penguins, species == "Adelie")

How do we then pass this to the arrange() function?

We could save this as a new object and then call arrange() on that object
We could place the output of filter() inside arrange()

The Ugly (Old-School) Way

# Let's create an object first: 'adelie_penguins'
adelie_penguins <- filter(penguins, species == "Adelie")

# Now we can pass this to `arrange()`
arrange(adelie_penguins, body_mass_g)

Is this any good?

A complete analysis would lead to a workspace with multiple, similar objects, e.g. adelie_penguins, penguins_2007, torgerson_penguins, etc.

This can become very messy and confusing

Another Ugly (Old-School) Way

# This time, we can wrap the output of one function inside the other
arrange(
  filter(penguins, species == "Adelie"), 
  body_mass_g
)

We have first filtered our dataset \(\implies\) becomes the tibble to be sorted

Is this any good?

Functions are executed in order from the inside-most function to the outer-most. First the filtering is done, and then this is passed to arrange()

Can become very messy if calling 10 functions in a row

The Pipe Operator

R v.4.1 introduced the (base) pipe operator: |>
Exactly like sticking a pipe or a hose on the output of one function then placing the pipe as the input of the next

This allows us to pass the output of one function to the next
- Can chain together multiple functions
- No need to create intermediate objects
- No need to wrap the output of one function inside another

The output of the first function is passed to the next
- By default, it will be the first argument

The Pipe Operator

# This is conventionally how we've used filter
filter(penguins, species == "Adelie")

# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The Pipe Operator

# Here, we're passing the object to filter using |>
penguins |> filter(species == "Adelie")

# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Calling an object by name simply returns the complete object

The Pipe Operator

# Filter the object, then pass the filtered object to arrange using the pipe
penguins |> filter(species == "Adelie") |> arrange(body_mass_g)

# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe              36.5          16.6               181        2850
 2 Adelie  Biscoe              36.4          17.1               184        2850
 3 Adelie  Biscoe              34.5          18.1               187        2900
 4 Adelie  Dream               33.1          16.1               178        2900
 5 Adelie  Torgersen           38.6          17                 188        2900
 6 Adelie  Biscoe              37.9          18.6               193        2925
 7 Adelie  Dream               37.5          18.9               179        2975
 8 Adelie  Dream               37            16.9               185        3000
 9 Adelie  Dream               37.3          16.8               192        3000
10 Adelie  Torgersen           35.9          16.6               190        3050
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Forming a Chain Across Multiple Lines

A common practice is to spread these chains across multiple lines
We can comment anywhere we please

penguins |> 
  filter(species == "Adelie") |> # First filter the species
  arrange(body_mass_g) # Now sort by body mass

Makes long chains much easier to read
Can easily comment out a line when building/testing code

A Longer Example

Let’s find the heaviest 5 Adelie penguins

penguins |> 
  filter(species == "Adelie") |> # First filter the species
  arrange(desc(body_mass_g)) |> # Now sort by body mass, descending
  slice_head(n = 5) # Take the first 5 rows

Revisiting `dplyr`

Additional Capabilities of `dplyr`

The function distinct() will remove duplicate rows across the provided columns

## Return the unique values in the species column
penguins |> distinct(species)

## Return the unique combinations of the species and island columns
penguins |> distinct(species, island)

Bringing Everything Together

The problem we want to address is a missing column from the penguins data
Check the object penguins_raw
- This is the master object which penguins was derived from

We’d like to add the studyName column to penguins for formal reporting

penguins_raw |>
  mutate(year = year(`Date Egg`)) |> # `year()` is from the package lubridate
  distinct(studyName, year)

Using `case_when()`

Now that we know our basic structure
\(\implies\) use case_when()

This is like an ifelse statement with multiple conditions

if (condition is true) 
  do something
else 
  do something else
endif

Using `case_when()`

We can use this inside a mutate function
Logical tests are performed sequentially
- if LHS is TRUE ~ Assign the RHS value

penguins |>
  mutate(
    studyName = case_when(
      year == 2007 ~ "PAL0708",
      year == 2008 ~ "PAL0809",
      year == 2009 ~ "PAL0910",
    )
  )

An Alternative Using Joins

Many coding problems have more than one viable solution
We could also use a left/right_join()

## Create a summary of the studyNames by year
studies <- penguins_raw |>
  mutate(year = year(`Date Egg`)) |> 
  distinct(studyName, year)

An Alternative Using Joins

We could now use left_join() to add this to penguins
- penguins will be taken as a scaffold to add values to
- Is on the LHS of the pipe \(\therefore\) left_join()

## Use `left_join()` to join the two datasets on the `year` column
penguins |> left_join(studies, by = "year")

Notice that the simple dataset studies was expanded
- Every time the year values matched \(\implies\) studyName was added

The downside to this method is we created a new object (studies)

An Alternative Using Joins

We could “flip” this using right_join()
- The dataset on the RHS is taken as the scaffold

# Create the summaries on the fly, then use `penguins` as the RHS scaffold
penguins_raw |>
  mutate(year = year(`Date Egg`)) |> 
  distinct(studyName, year) |>
  right_join(penguins)

Only difference is the column order

Additional Joins

Multiple columns can be passed to the by argument

inner_join() produces the subset where all complete matches are present
full_join() produces the union of the two datasets
- All rows from both datasets are returned
- Missing values are filled with NA

Using Distinct Wisely

Setting the argument .keep_all = TRUE will return all columns
- Only the first unique appearance will be retained

## Sort by body mass, descending then use distinct to return the
## first unique combination of species and island
penguins |>
  arrange(desc(body_mass_g)) |> 
  distinct(species, island, .keep_all = TRUE)

Challenges

Find the lightest 5 Gentoo penguins
- Return the weights in kg instead of g
Find the mean bill length for male penguins
- Sort your answer in descending order
Find how many penguins were measured per year on each island
- Sort your answer by island, then by year
Use slice_max() to return the same penguins as the final example, i.e. the heaviest from each species and island?
Recreate penguins from penguins_raw, but retaining studyName and the Individual ID as additional columns

Closing Comments

The shell has had a pipe since 1973¹
- Originally > then changed to |
The R base pipe (|>) was introduced in v.4.1 (2021)

An earlier pipe (%>%) was introduced in the package magrittr (~2014)
- Often referred to as “The Magrittr”
- Much internet code will use this pipe
- Very similar in behaviour
  BUT differences are non-trivial

Rene Magritte, *The Treachery of Images*, 1929

Using Functions in Series

The Pipe Operator

Motivation

The Ugly (Old-School) Way

The Ugly (Old-School) Way

Another Ugly (Old-School) Way

The Pipe Operator

The Pipe Operator

The Pipe Operator

The Pipe Operator

Forming a Chain Across Multiple Lines

A Longer Example

Revisiting dplyr

Additional Capabilities of dplyr

Bringing Everything Together

Using case_when()

Using case_when()

An Alternative Using Joins

An Alternative Using Joins

An Alternative Using Joins

Additional Joins

Using Distinct Wisely

Challenges

Closing Comments

Revisiting `dplyr`

Additional Capabilities of `dplyr`

Using `case_when()`

Using `case_when()`