Using Functions in Series

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 8, 2025

The Pipe Operator

Motivation

  • We’ve seen a bunch of ways to explore our data
  • Only using one function at a time
  • Perhaps we might like to subset (i.e. filter) our data, then sort it
  • How do we do that?

The Ugly (Old-School) Way

  • Let’s say we only want information on the Adelie penguins
  • Then we want to sort this by body_mass
## First we can subset the dataset
filter(penguins, species == "Adelie")

How do we then pass this to the arrange() function?

  1. We could save this as a new object and then call arrange() on that object
  2. We could place the output of filter() inside arrange()

The Ugly (Old-School) Way

# Let's create an object first: 'adelie_penguins'
adelie_penguins <- filter(penguins, species == "Adelie")

# Now we can pass this to `arrange()`
arrange(adelie_penguins, body_mass_g)

Is this any good?

A complete analysis would lead to a workspace with multiple, similar objects, e.g. adelie_penguins, penguins_2007, torgerson_penguins, etc.

This can become very messy and confusing

Another Ugly (Old-School) Way

# This time, we can wrap the output of one function inside the other
arrange(
  filter(penguins, species == "Adelie"), 
  body_mass_g
)
  • We have first filtered our dataset \(\implies\) becomes the tibble to be sorted

Is this any good?

Functions are executed in order from the inside-most function to the outer-most. First the filtering is done, and then this is passed to arrange()

Can become very messy if calling 10 functions in a row

The Pipe Operator

  • R v.4.1 introduced the (base) pipe operator: |>
  • Exactly like sticking a pipe or a hose on the output of one function then placing the pipe as the input of the next
  • This allows us to pass the output of one function to the next
    • Can chain together multiple functions
    • No need to create intermediate objects
    • No need to wrap the output of one function inside another
  • The output of the first function is passed to the next
    • By default, it will be the first argument

The Pipe Operator

# This is conventionally how we've used filter
filter(penguins, species == "Adelie") 
# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The Pipe Operator

# Here, we're passing the object to filter using |>
penguins |> filter(species == "Adelie")
# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>
  • Calling an object by name simply returns the complete object

The Pipe Operator

# Filter the object, then pass the filtered object to arrange using the pipe
penguins |> filter(species == "Adelie") |> arrange(body_mass_g)
# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe              36.5          16.6               181        2850
 2 Adelie  Biscoe              36.4          17.1               184        2850
 3 Adelie  Biscoe              34.5          18.1               187        2900
 4 Adelie  Dream               33.1          16.1               178        2900
 5 Adelie  Torgersen           38.6          17                 188        2900
 6 Adelie  Biscoe              37.9          18.6               193        2925
 7 Adelie  Dream               37.5          18.9               179        2975
 8 Adelie  Dream               37            16.9               185        3000
 9 Adelie  Dream               37.3          16.8               192        3000
10 Adelie  Torgersen           35.9          16.6               190        3050
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Forming a Chain Across Multiple Lines

  • A common practice is to spread these chains across multiple lines
  • We can comment anywhere we please
penguins |> 
  filter(species == "Adelie") |> # First filter the species
  arrange(body_mass_g) # Now sort by body mass
  • Makes long chains much easier to read
  • Can easily comment out a line when building/testing code

A Longer Example

  • Let’s find the heaviest 5 Adelie penguins
penguins |> 
  filter(species == "Adelie") |> # First filter the species
  arrange(desc(body_mass_g)) |> # Now sort by body mass, descending
  slice_head(n = 5) # Take the first 5 rows

Revisiting dplyr

Additional Capabilities of dplyr

  • The function distinct() will remove duplicate rows across the provided columns
## Return the unique values in the species column
penguins |> distinct(species)

## Return the unique combinations of the species and island columns
penguins |> distinct(species, island)

Bringing Everything Together

  • The problem we want to address is a missing column from the penguins data
  • Check the object penguins_raw
    • This is the master object which penguins was derived from
  • We’d like to add the studyName column to penguins for formal reporting
penguins_raw |>
  mutate(year = year(`Date Egg`)) |> # `year()` is from the package lubridate
  distinct(studyName, year)

Using case_when()

  • Now that we know our basic structure
    \(\implies\) use case_when()
  • This is like an ifelse statement with multiple conditions
if (condition is true) 
  do something
else 
  do something else
endif

Using case_when()

  • We can use this inside a mutate function
  • Logical tests are performed sequentially
    • if LHS is TRUE ~ Assign the RHS value
penguins |>
  mutate(
    studyName = case_when(
      year == 2007 ~ "PAL0708",
      year == 2008 ~ "PAL0809",
      year == 2009 ~ "PAL0910",
    )
  )

An Alternative Using Joins

  • Many coding problems have more than one viable solution
  • We could also use a left/right_join()
## Create a summary of the studyNames by year
studies <- penguins_raw |>
  mutate(year = year(`Date Egg`)) |> 
  distinct(studyName, year)

An Alternative Using Joins

  • We could now use left_join() to add this to penguins
    • penguins will be taken as a scaffold to add values to
    • Is on the LHS of the pipe \(\therefore\) left_join()
## Use `left_join()` to join the two datasets on the `year` column
penguins |> left_join(studies, by = "year") 
  • Notice that the simple dataset studies was expanded
    • Every time the year values matched \(\implies\) studyName was added
  • The downside to this method is we created a new object (studies)

An Alternative Using Joins

  • We could “flip” this using right_join()
    • The dataset on the RHS is taken as the scaffold
# Create the summaries on the fly, then use `penguins` as the RHS scaffold
penguins_raw |>
  mutate(year = year(`Date Egg`)) |> 
  distinct(studyName, year) |>
  right_join(penguins)
  • Only difference is the column order

Additional Joins

  • Multiple columns can be passed to the by argument
  • inner_join() produces the subset where all complete matches are present
  • full_join() produces the union of the two datasets
    • All rows from both datasets are returned
    • Missing values are filled with NA

Using Distinct Wisely

  • Setting the argument .keep_all = TRUE will return all columns
    • Only the first unique appearance will be retained
## Sort by body mass, descending then use distinct to return the
## first unique combination of species and island
penguins |>
  arrange(desc(body_mass_g)) |> 
  distinct(species, island, .keep_all = TRUE)

Challenges

  1. Find the lightest 5 Gentoo penguins
    • Return the weights in kg instead of g
  2. Find the mean bill length for male penguins
    • Sort your answer in descending order
  3. Find how many penguins were measured per year on each island
    • Sort your answer by island, then by year
  4. Use slice_max() to return the same penguins as the final example, i.e. the heaviest from each species and island?
  5. Recreate penguins from penguins_raw, but retaining studyName and the Individual ID as additional columns

Closing Comments

  • The shell has had a pipe since 19731
    • Originally > then changed to |
  • The R base pipe (|>) was introduced in v.4.1 (2021)
  • An earlier pipe (%>%) was introduced in the package magrittr (~2014)
    • Often referred to as “The Magrittr”
    • Much internet code will use this pipe
    • Very similar in behaviour
      BUT differences are non-trivial

Rene Magritte, The Treachery of Images, 1929