## First we can subset the dataset
filter(penguins, species == "Adelie")
Using Functions in Series
RAdelaide 2025
The Pipe Operator
Motivation
- We’ve seen a bunch of ways to explore our data
- Only using one function at a time
- Perhaps we might like to subset (i.e. filter) our data, then sort it
- How do we do that?
- Should we stick with DataExploration.R?
The Ugly (Old-School) Way
- Let’s say we only want information on the Adelie penguins
- Then we want to sort this by body_mass
How do we then pass this to the arrange()
function?
- We could save this as a new object and then call
arrange()
on that object - We could place the output of
filter()
insidearrange()
Another Ugly (Old-School) Way
# This time, we can wrap the output of one function inside the other
arrange(
filter(penguins, species == "Adelie"),
body_mass_g )
- We have first filtered our dataset \(\implies\) becomes the
tibble
to be sorted
Is this any good?
Functions are executed in order from the inside-most function to the outer-most. First the filtering is done, and then this is passed to arrange()
Can become very messy if calling 10 functions in a row
The Pipe Operator
- R v.4.1 introduced the (base) pipe operator:
|>
- Exactly like sticking a pipe or a hose on the output of one function then placing the pipe as the input of the next
- This allows us to pass the output of one function to the next
- Can chain together multiple functions
- No need to create intermediate objects
- No need to wrap the output of one function inside another
- The output of the first function is passed to the next
- By default, it will be the first argument
Forming a Chain Across Multiple Lines
- A common practice is to spread these chains across multiple lines
- We can comment anywhere we please
|>
penguins filter(species == "Adelie") |> # First filter the species
arrange(body_mass_g) # Now sort by body mass
- Makes long chains much easier to read
- Can easily comment out a line when building/testing code
A Longer Example
- Let’s find the heaviest 5 Adelie penguins
|>
penguins filter(species == "Adelie") |> # First filter the species
arrange(desc(body_mass_g)) |> # Now sort by body mass, descending
slice_head(n = 5) # Take the first 5 rows
Revisiting dplyr
Additional Capabilities of dplyr
- The function
distinct()
will remove duplicate rows across the provided columns
## Return the unique values in the species column
|> distinct(species)
penguins
## Return the unique combinations of the species and island columns
|> distinct(species, island) penguins
Bringing Everything Together
- The problem we want to address is a missing column from the
penguins
data - Check the object
penguins_raw
- This is the master object which
penguins
was derived from
- This is the master object which
- We’d like to add the
studyName
column topenguins
for formal reporting
|>
penguins_raw mutate(year = year(`Date Egg`)) |> # `year()` is from the package lubridate
distinct(studyName, year)
How would we do this? I can think of two ways
Using case_when()
- Now that we know our basic structure
\(\implies\) usecase_when()
- This is like an
ifelse
statement with multiple conditions
if (condition is true)
do something
else
do something else
endif
An Alternative Using Joins
- Many coding problems have more than one viable solution
- We could also use a
left/right_join()
## Create a summary of the studyNames by year
<- penguins_raw |>
studies mutate(year = year(`Date Egg`)) |>
distinct(studyName, year)
Additional Joins
- Multiple columns can be passed to the
by
argument
inner_join()
produces the subset where all complete matches are presentfull_join()
produces the union of the two datasets- All rows from both datasets are returned
- Missing values are filled with
NA
Using Distinct Wisely
- Setting the argument
.keep_all = TRUE
will return all columns- Only the first unique appearance will be retained
## Sort by body mass, descending then use distinct to return the
## first unique combination of species and island
|>
penguins arrange(desc(body_mass_g)) |>
distinct(species, island, .keep_all = TRUE)
Challenges
- Find the lightest 5 Gentoo penguins
- Return the weights in
kg
instead ofg
- Return the weights in
- Find the mean bill length for male penguins
- Sort your answer in descending order
- Find how many penguins were measured per year on each island
- Sort your answer by island, then by year
- Use
slice_max()
to return the same penguins as the final example, i.e. the heaviest from each species and island?
- Recreate
penguins
frompenguins_raw
, but retainingstudyName
and theIndividual ID
as additional columns
Closing Comments
- The shell has had a pipe since 19731
- Originally
>
then changed to|
- Originally
- The
R
base pipe (|>
) was introduced in v.4.1 (2021)
- An earlier pipe (
%>%
) was introduced in the packagemagrittr
(~2014)- Often referred to as “The Magrittr”
- Much internet code will use this pipe
- Very similar in behaviour
BUT differences are non-trivial
Interestingly, the shell was also developed at Bell Labs in the 1970s
Footnotes
https://en.wikipedia.org/wiki/Pipeline_(Unix)↩︎