Exploring Data In R

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 8, 2025

The package palmerpenguins

Introducing The Penguins

  • We’ll be looking at the “Palmer Penguins” dataset
    • Taken from https://allisonhorst.github.io/palmerpenguins/index.html
    • 3 species of penguins from the Palmer Archipelago, Antarctica
  • Various physiological measurements

Exploring The Penguins

  • We won’t be creating any objects in this section
  • Learning how to explore a dataset using dplyr
    • For organising data
    • For creating summary tables
    • To prepare for creating plots & figures
    • Is a core tidyverse package
  • We’ll cover a huge amount of ground
    • Hopefully the exercises & challenges help

Starting An R Script

  • Best practice is to ALWAYS record your code
  • Today we’ll use an R script
    • Is a plain text file
    • Is a combination of code and comments
    • The filename should end with .R
  • Nothing we enter in the script is executed
    \(\implies\) until we intentionally execute the code

Starting An R Script

  • Create a new file DataExploration.R type the following
## Load the palmerpenguins package
library(palmerpenguins)
  • The # symbol indicates a comment
    \(\implies\) ignored by R and nothing is executed
    • Used to explain code to humans
  • We write code for two primary reasons
    1. To be executed by R, and
    2. To be read and understood by humans (usually us in a few months)

Executing Code

  • So far, no code has been executed from this script
  • Check your Environment Tab to see if there are any objects
    • If there is an object (most likely x) \(\implies\) click the broom icon
    • This will clear any existing objects from the environment

  1. Place the cursor on the line of code library(palmerpenguins), or
  2. Use the keyboard shortcut Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac), or
  3. Click the Run button in the top right of the script editor

What have we done so far?

  • We have simply loaded the palmerpenguins package
    • We called the function library()
    • This loads all the functions and the data in a requested package
    • The package name appears inside the parentheses ()
    • Very similar to calling sqrt(5) as we saw earlier
  • For python users the equivalent would be import palmerpenguins

The penguins dataset

  • The palmerpenguins package contains the penguins dataset already loaded
    • Add the comment and code below, then execute
# Let's look at the penguins dataset
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The penguins dataset

  • The penguins dataset is a tibble
  • The number of rows and columns is shown at the top (344 x 8)
    • Printed as a comment
  • Next is the column names:
    • species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex and year
    • The last two columns may just be listed at the bottom
    • Is a function of screen width, font size etc

The penguins dataset

  • Underneath each column name is the data type
    • Will explore data types in more detail
    • Each column has the same type of data
    • <fct> means ‘factor’ \(\implies\) a categorical variable
    • <dbl> means ‘double’ \(\implies\) a numeric variable
    • <int> means ‘integer’ \(\implies\) a whole number
  • Final lines show how many more rows & columns there are
  • Notice there are no rownames, just row numbers

Exploring Penguins

  • A common initial data exploration task is to get a summary of the data
# The column names are:
colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Copy the column names after the code, then comment

  1. Highlight the output in the console
  2. Copy & paste into the script
  3. Make a comment by adding # at the start of each line

Commenting Multiple Lines

  • Comments can be toggled on/off across multiple lines using:
    • Ctrl + Shift + C (Win/Linux)
    • Cmd + Shift + C (Mac)

Calling Functions

  • Notice that we placed the object inside the parentheses () after the function
  • Let’s continue checking the object
# Find out how big the dataset is
nrow(penguins)
ncol(penguins)
dim(penguins)
tail(penguins)
  • Can you figure out what each of these functions does?

The package
dplyr

Exploring Penguins with dplyr

  • The package dplyr \(\rightarrow\) functions for data exploration and manipulation
  • Let’s load the package as part of the tidyverse
    • I personally load all packages at the start of a script
    • Add this underneath the library(palmerpenguins) line
    • All functions in this section from dplyr
library(tidyverse) # Load the complete tidyverse
  • We’ll use these functions to explore the penguins dataset
  • Then we can modify the dataset

Exploring Penguins with dplyr

## The `glimpse()` function is provided by dplyr
## Can be very helpful with large column numbers
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
  • So far we haven’t actually saved any objects (using <-)
  • Functions head(), tail(), and glimpse() are all printing to the console

Sorting Penguins

  • dplyr provides some Excel-like functions:
    • arrange() will sort the data
    • filter() will filter the data
# Sort the penguins by body mass in increasing order
arrange(penguins, body_mass_g)

# Sort the penguins by body mass in decreasing order
arrange(penguins, desc(body_mass_g))


# Sort multiple columns in the order passed to the function
arrange(penguins, species, body_mass_g)

Filtering Penguins

Filtering relies on logical tests

Symbol Description
== Is exactly Equal To
> / < Is Greater/Less Than
>= / <= Is Greater/Less Than or Equal To
!= Is Not Equal To
is.na() Is Missing Value
%in% Is in a set of possible values
  • In most languages, ! is interpreted as NOT

Filtering Penguins

## Subset the data to those from the Island of Dream
filter(penguins, island == "Dream")

## Subset the data to those NOT from the Island of Dream
filter(penguins, island != "Dream")

## Subset the penguins to those lighter than 4000g
filter(penguins, body_mass_g < 4000)

## Find the penguins from Dream that are heavier than 4000g
filter(penguins, island == "Dream", body_mass_g > 4000)

Slicing Penguins

  • filter() returns the rows that match a given criteria
  • slice() can be used to return rows by position
## Slice out the first 10 rows of the penguins dataset
slice(penguins, 1:10)

## Now slice out the 101st to 110th rows
slice(penguins, 101:110)

A Brief Diversion

  • In the two previous examples we used a sequence of consecutive values
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
101:110
 [1] 101 102 103 104 105 106 107 108 109 110
  • We refer to one or more values in R as a vector
    • These are integer vectors
    • Integers are often used to denote rows/columns etc

A Brief Diversion

  • In R we can form a vector by combining values with the function c()
c(1, 3, 5, 7, 9)
[1] 1 3 5 7 9
# Return the first few odd numbered rows using a vector of positions
slice(penguins, c(1, 3, 5, 7, 9))


# Rows can be returned in any order
slice(penguins, c(3, 7, 5, 1))

Selecting Penguins

  • filter() and slice() can be used to return rows
  • select() can be used to return columns
# Columns can be 'selected' by passing the required column names
select(penguins, species, island, body_mass_g)


# Columns can also be selected by position using a vector
select(penguins, c(1, 2, 6))

Using Names Or Position

  1. Do the above lines give the same result?
  2. Would either one be preferable?

Helper Functions

  • dplyr provides some helper functions to make selecting columns easier
  • starts_with(), ends_with() and contains(), are very useful!
  • everything() is also surprisingly useful
  • any_of() and all_of() are a bit more advanced
# Select all columns that start with "bill", after the species and island columns
select(penguins, species, island, starts_with("bill"))

# Select all length-related columns, after the species and island columns
select(penguins, species, island, contains("length"))

Helper Functions

# Place all metadata first, followed by measured values
# This line shows how useful 'everything()' can be
select(penguins, species, island, sex, year, everything())
  • To remove a column, we preface the selection with -
# Remove all columns that end with 'mm'
select(penguins, -ends_with("mm"))

Relocating Penguins

# Relocate is a newer addition to dplyr and can also be used to reorder columns
# The arguments .before and .after can be used to specify where to place columns
# Here we're moving columns with an underscore to after the year column
relocate(penguins, contains("_"), .after = year)

# This time, we're moving the sex and year columns to 'before' the bill columns
relocate(penguins, sex, year, .before = starts_with("bill"))

Renaming Penguins

  • When we call select, we can rename columns on the fly
## Rename the island column as 'location', leaving the order unchanged
select(penguins, species, location = island, everything())
  • Alternatively, we can just use rename()
## Or just rename the individual column
rename(penguins, location = island)

Modifying Columns With mutate()

  • So far, we’ve only subset our data using various methods
  • mutate() is used to modify existing columns or create new ones
# Create a column called `body_mass_kg` that is the body mass in kg
mutate(penguins, body_mass_kg = body_mass_g / 1000)

Exercises

  1. Use filter() to find all female penguins
    • Then find all female penguins with a flipper length greater than 215mm
  2. Use filter() to find all penguins where sex is missing (NA)
  3. Sort the dataset by bill_length in descending order
  4. Use select() to return the species, island, and year columns
    • Repeat trying an alternative approach to your first answer
  5. Place the year column after island and remove the sex column
  6. Create the column bill_ratio by dividing bill length by depth

Summarising The Dataset

Obtaining Summaries

  • dplyr also provides functions to summarise data
    • count() and summarise() are the most common
    • We can tell these functions which columns to summarise by
# Count the number of penguins by species
count(penguins, species)

# Count the number of penguins by species and island
count(penguins, species, island)

# If we change the order of the columns, we get a different order in our results
count(penguins, island, species)

# The argument `sort = TRUE` will sort the results
count(penguins, species, island, sort = TRUE)

Obtaining Summaries

  • More nuanced summaries can be obtained using summarise()
    • We now pass the grouping variable to the argument .by
    • The summary column should also be given a name
# Find the mean weight of each species
summarise(penguins, mean_weight = mean(body_mass_g), .by = species)
  • This is the first time those missing values have caused us grief
    • The argument na.rm = TRUE will tell mean() to ignore NA values
# Ignore the missing values when calculating the mean
summarise(penguins, mean_weight = mean(body_mass_g, na.rm = TRUE), .by = species)

Obtaining Summaries

  • We can also summarise using multiple columns
    • We’ve combined groups using the function (c())
# Summarise by both species and year
summarise(
  penguins, mean_weight = mean(body_mass_g, na.rm = TRUE), .by = c(species, year)
)

Obtaining Summaries

  • The previous code is split across multiple lines just to fit on the slide
  • We could split across multiple lines for greater readability
# Summarise by both species and year
summarise(
  penguins, 
  mean_weight = mean(body_mass_g, na.rm = TRUE), 
  .by = c(species, year)
)

Obtaining Summaries

  • This strategy can help when creating multiple summary columns
  • Instead of using count() we can call n() as part of summarise()
# Summarise by both species and year, counting the number of penguins
summarise(
  penguins, 
  mean_weight = mean(body_mass_g, na.rm = TRUE), 
  total = n(),
  .by = c(species, year)
)

Grouping Arguments

  • The recent addition of .by has beefed up some earlier functions
# Grab the first 5 rows of each species
slice(penguins, 1:5, .by = species)
  • Some newer extensions of slice() are also useful for summarising data
    • slice_min(), slice_max()
# Return the heaviest penguin from each species
slice_max(penguins, body_mass_g, n = 1, by = species)
  • Confusingly, the argument .by has become by here 🤷🏻
    • The argument n = 1 says return only one penguin per species

Conclusion

  • All functions so far have enabled exploration
    • arrange(), filter
    • select() + helper functions
    • slice(), slice_max(), slice_min()
    • count(), summarise()
  • Many others we didn’t cover
    • slice_head(), slice_tail(), slice_sample()
    • Multiple join methods

Conclusion

  • Have never over-written our original dataset
  • Have never created a new object
  • Real world applications:
    • Preparing for plotting or regression
    • Summarising data for tables
  • Already a huge amount to remember!
  • We’ll be doing more exercises soon

A Word Of Caution

  • dplyr was written to parallel some SQL functions
    \(\implies\) Uses function names from SQL
  • Some other packages had the same idea much earlier
    • e.g. multiple packages contain a filter or select function
  • If either function gives unexpected output
    \(\implies\) call directly from the package (aka namespace)
  • We can use dplyr::select() instead of select()
    • Ensures we use the dplyr version
    • Same for dplyr::filter()