Exploring Data In R

RAdelaide 2025

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

Published

July 8, 2025

Summarising The Dataset

Obtaining Summaries

  • dplyr also provides functions to summarise data
    • count() and summarise() are the most common
    • We can tell these functions which columns to summarise by
# Count the number of penguins by species
count(penguins, species)

# Count the number of penguins by species and island
count(penguins, species, island)

# If we change the order of the columns, we get a different order in our results
count(penguins, island, species)

# The argument `sort = TRUE` will sort the results
count(penguins, species, island, sort = TRUE)

Obtaining Summaries

  • More nuanced summaries can be obtained using summarise()
    • We now pass the grouping variable to the argument .by
    • The summary column should also be given a name
# Find the mean weight of each species
summarise(penguins, mean_weight = mean(body_mass_g), .by = species)
  • This is the first time those missing values have caused us grief
    • The argument na.rm = TRUE will tell mean() to ignore NA values
# Ignore the missing values when calculating the mean
summarise(penguins, mean_weight = mean(body_mass_g, na.rm = TRUE), .by = species)

Obtaining Summaries

  • We can also summarise using multiple columns
    • We’ve combined groups using the function (c())
# Summarise by both species and year
summarise(
  penguins, mean_weight = mean(body_mass_g, na.rm = TRUE), .by = c(species, year)
)

Obtaining Summaries

  • The previous code is split across multiple lines just to fit on the slide
  • We could split across multiple lines for greater readability
# Summarise by both species and year
summarise(
  penguins, 
  mean_weight = mean(body_mass_g, na.rm = TRUE), 
  .by = c(species, year)
)

Obtaining Summaries

  • This strategy can help when creating multiple summary columns
  • Instead of using count() we can call n() as part of summarise()
# Summarise by both species and year, counting the number of penguins
summarise(
  penguins, 
  mean_weight = mean(body_mass_g, na.rm = TRUE), 
  total = n(),
  .by = c(species, year)
)

Grouping Arguments

  • The recent addition of .by has beefed up some earlier functions
# Grab the first 5 rows of each species
slice(penguins, 1:5, .by = species)
  • Some newer extensions of slice() are also useful for summarising data
    • slice_min(), slice_max()
# Return the heaviest penguin from each species
slice_max(penguins, body_mass_g, n = 1, by = species)
  • Confusingly, the argument .by has become by here 🤷🏻
    • The argument n = 1 says return only one penguin per species

Conclusion

  • All functions so far have enabled exploration
    • arrange(), filter
    • select() + helper functions
    • slice(), slice_max(), slice_min()
    • count(), summarise()
  • Many others we didn’t cover
    • slice_head(), slice_tail(), slice_sample()
    • Multiple join methods

Conclusion

  • Have never over-written our original dataset
  • Have never created a new object
  • Real world applications:
    • Preparing for plotting or regression
    • Summarising data for tables
  • Already a huge amount to remember!
  • We’ll be doing more exercises soon

A Word Of Caution

  • dplyr was written to parallel some SQL functions
    \(\implies\) Uses function names from SQL
  • Some other packages had the same idea much earlier
    • e.g. multiple packages contain a filter or select function
  • If either function gives unexpected output
    \(\implies\) call directly from the package (aka namespace)
  • We can use dplyr::select() instead of select()
    • Ensures we use the dplyr version
    • Same for dplyr::filter()