Exploring Data In R

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 8, 2025

The package `palmerpenguins`

Introducing The Penguins

We’ll be looking at the “Palmer Penguins” dataset
- Taken from https://allisonhorst.github.io/palmerpenguins/index.html
- 3 species of penguins from the Palmer Archipelago, Antarctica
Various physiological measurements

Exploring The Penguins

We won’t be creating any objects in this section
Learning how to explore a dataset using dplyr
- For organising data
- For creating summary tables
- To prepare for creating plots & figures
- Is a core tidyverse package

We’ll cover a huge amount of ground
- Hopefully the exercises & challenges help

Starting An R Script

Best practice is to ALWAYS record your code
Today we’ll use an R script
- Is a plain text file
- Is a combination of code and comments
- The filename should end with .R

Nothing we enter in the script is executed
\(\implies\) until we intentionally execute the code

Starting An R Script

Create a new file DataExploration.R type the following

## Load the palmerpenguins package
library(palmerpenguins)

The # symbol indicates a comment
\(\implies\) ignored by R and nothing is executed
- Used to explain code to humans

We write code for two primary reasons
1. To be executed by R, and
2. To be read and understood by humans (usually us in a few months)

Executing Code

So far, no code has been executed from this script
Check your Environment Tab to see if there are any objects
- If there is an object (most likely x) \(\implies\) click the broom icon
- This will clear any existing objects from the environment

Place the cursor on the line of code library(palmerpenguins), or
Use the keyboard shortcut Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac), or
Click the Run button in the top right of the script editor

What have we done so far?

We have simply loaded the palmerpenguins package
- We called the function library()
- This loads all the functions and the data in a requested package
- The package name appears inside the parentheses ()
- Very similar to calling sqrt(5) as we saw earlier

For python users the equivalent would be import palmerpenguins

The `penguins` dataset

The palmerpenguins package contains the penguins dataset already loaded
- Add the comment and code below, then execute

# Let's look at the penguins dataset
penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The `penguins` dataset

The penguins dataset is a tibble
The number of rows and columns is shown at the top (344 x 8)
- Printed as a comment

Next is the column names:
- species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex and year
- The last two columns may just be listed at the bottom
- Is a function of screen width, font size etc

The `penguins` dataset

Underneath each column name is the data type
- Will explore data types in more detail
- Each column has the same type of data
- <fct> means ‘factor’ \(\implies\) a categorical variable
- <dbl> means ‘double’ \(\implies\) a numeric variable
- <int> means ‘integer’ \(\implies\) a whole number

Final lines show how many more rows & columns there are
Notice there are no rownames, just row numbers

Exploring Penguins

A common initial data exploration task is to get a summary of the data

# The column names are:
colnames(penguins)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

Copy the column names after the code, then comment

Highlight the output in the console
Copy & paste into the script
Make a comment by adding # at the start of each line

Commenting Multiple Lines

Comments can be toggled on/off across multiple lines using:
- Ctrl + Shift + C (Win/Linux)
- Cmd + Shift + C (Mac)

Calling Functions

Notice that we placed the object inside the parentheses () after the function

Let’s continue checking the object

# Find out how big the dataset is
nrow(penguins)
ncol(penguins)
dim(penguins)
tail(penguins)

Can you figure out what each of these functions does?

The package
`dplyr`

Exploring Penguins with `dplyr`

The package dplyr \(\rightarrow\) functions for data exploration and manipulation
Let’s load the package as part of the tidyverse
- I personally load all packages at the start of a script
- Add this underneath the library(palmerpenguins) line
- All functions in this section from dplyr

library(tidyverse) # Load the complete tidyverse

We’ll use these functions to explore the penguins dataset
Then we can modify the dataset

Exploring Penguins with `dplyr`

## The `glimpse()` function is provided by dplyr
## Can be very helpful with large column numbers
glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

So far we haven’t actually saved any objects (using <-)
Functions head(), tail(), and glimpse() are all printing to the console

Sorting Penguins

dplyr provides some Excel-like functions:
- arrange() will sort the data
- filter() will filter the data

# Sort the penguins by body mass in increasing order
arrange(penguins, body_mass_g)

# Sort the penguins by body mass in decreasing order
arrange(penguins, desc(body_mass_g))

# Sort multiple columns in the order passed to the function
arrange(penguins, species, body_mass_g)

Filtering Penguins

Filtering relies on logical tests

Symbol	Description
`==`	Is exactly Equal To
`>` / `<`	Is Greater/Less Than
`>=` / `<=`	Is Greater/Less Than or Equal To
`!=`	Is Not Equal To
`is.na()`	Is Missing Value
`%in%`	Is in a set of possible values

In most languages, ! is interpreted as NOT

Filtering Penguins

## Subset the data to those from the Island of Dream
filter(penguins, island == "Dream")

## Subset the data to those NOT from the Island of Dream
filter(penguins, island != "Dream")

## Subset the penguins to those lighter than 4000g
filter(penguins, body_mass_g < 4000)

## Find the penguins from Dream that are heavier than 4000g
filter(penguins, island == "Dream", body_mass_g > 4000)

Slicing Penguins

filter() returns the rows that match a given criteria
slice() can be used to return rows by position

## Slice out the first 10 rows of the penguins dataset
slice(penguins, 1:10)

## Now slice out the 101st to 110th rows
slice(penguins, 101:110)

A Brief Diversion

In the two previous examples we used a sequence of consecutive values

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

101:110

 [1] 101 102 103 104 105 106 107 108 109 110

We refer to one or more values in R as a vector
- These are integer vectors
- Integers are often used to denote rows/columns etc

A Brief Diversion

In R we can form a vector by combining values with the function c()

c(1, 3, 5, 7, 9)

[1] 1 3 5 7 9

# Return the first few odd numbered rows using a vector of positions
slice(penguins, c(1, 3, 5, 7, 9))

# Rows can be returned in any order
slice(penguins, c(3, 7, 5, 1))

Selecting Penguins

filter() and slice() can be used to return rows
select() can be used to return columns

# Columns can be 'selected' by passing the required column names
select(penguins, species, island, body_mass_g)

# Columns can also be selected by position using a vector
select(penguins, c(1, 2, 6))

Using Names Or Position

Do the above lines give the same result?
Would either one be preferable?

Helper Functions

dplyr provides some helper functions to make selecting columns easier
starts_with(), ends_with() and contains(), are very useful!
everything() is also surprisingly useful
any_of() and all_of() are a bit more advanced

# Select all columns that start with "bill", after the species and island columns
select(penguins, species, island, starts_with("bill"))

# Select all length-related columns, after the species and island columns
select(penguins, species, island, contains("length"))

Helper Functions

# Place all metadata first, followed by measured values
# This line shows how useful 'everything()' can be
select(penguins, species, island, sex, year, everything())

To remove a column, we preface the selection with -

# Remove all columns that end with 'mm'
select(penguins, -ends_with("mm"))

Relocating Penguins

# Relocate is a newer addition to dplyr and can also be used to reorder columns
# The arguments .before and .after can be used to specify where to place columns
# Here we're moving columns with an underscore to after the year column
relocate(penguins, contains("_"), .after = year)

# This time, we're moving the sex and year columns to 'before' the bill columns
relocate(penguins, sex, year, .before = starts_with("bill"))

Renaming Penguins

When we call select, we can rename columns on the fly

## Rename the island column as 'location', leaving the order unchanged
select(penguins, species, location = island, everything())

Alternatively, we can just use rename()

## Or just rename the individual column
rename(penguins, location = island)

Modifying Columns With `mutate()`

So far, we’ve only subset our data using various methods
mutate() is used to modify existing columns or create new ones

# Create a column called `body_mass_kg` that is the body mass in kg
mutate(penguins, body_mass_kg = body_mass_g / 1000)

Exercises

Use filter() to find all female penguins
- Then find all female penguins with a flipper length greater than 215mm
Use filter() to find all penguins where sex is missing (NA)
Sort the dataset by bill_length in descending order
Use select() to return the species, island, and year columns
- Repeat trying an alternative approach to your first answer
Place the year column after island and remove the sex column
Create the column bill_ratio by dividing bill length by depth

Summarising The Dataset

Obtaining Summaries

dplyr also provides functions to summarise data
- count() and summarise() are the most common
- We can tell these functions which columns to summarise by

# Count the number of penguins by species
count(penguins, species)

# Count the number of penguins by species and island
count(penguins, species, island)

# If we change the order of the columns, we get a different order in our results
count(penguins, island, species)

# The argument `sort = TRUE` will sort the results
count(penguins, species, island, sort = TRUE)

Obtaining Summaries

More nuanced summaries can be obtained using summarise()
- We now pass the grouping variable to the argument .by
- The summary column should also be given a name

# Find the mean weight of each species
summarise(penguins, mean_weight = mean(body_mass_g), .by = species)

This is the first time those missing values have caused us grief
- The argument na.rm = TRUE will tell mean() to ignore NA values

# Ignore the missing values when calculating the mean
summarise(penguins, mean_weight = mean(body_mass_g, na.rm = TRUE), .by = species)

Obtaining Summaries

We can also summarise using multiple columns
- We’ve combined groups using the function (c())

# Summarise by both species and year
summarise(
  penguins, mean_weight = mean(body_mass_g, na.rm = TRUE), .by = c(species, year)
)

Obtaining Summaries

The previous code is split across multiple lines just to fit on the slide
We could split across multiple lines for greater readability

# Summarise by both species and year
summarise(
  penguins, 
  mean_weight = mean(body_mass_g, na.rm = TRUE), 
  .by = c(species, year)
)

Obtaining Summaries

This strategy can help when creating multiple summary columns
Instead of using count() we can call n() as part of summarise()

# Summarise by both species and year, counting the number of penguins
summarise(
  penguins, 
  mean_weight = mean(body_mass_g, na.rm = TRUE), 
  total = n(),
  .by = c(species, year)
)

Grouping Arguments

The recent addition of .by has beefed up some earlier functions

# Grab the first 5 rows of each species
slice(penguins, 1:5, .by = species)

Some newer extensions of slice() are also useful for summarising data
- slice_min(), slice_max()

# Return the heaviest penguin from each species
slice_max(penguins, body_mass_g, n = 1, by = species)

Confusingly, the argument .by has become by here 🤷🏻
- The argument n = 1 says return only one penguin per species

Conclusion

All functions so far have enabled exploration
- arrange(), filter
- select() + helper functions
- slice(), slice_max(), slice_min()
- count(), summarise()
Many others we didn’t cover
- slice_head(), slice_tail(), slice_sample()
- Multiple join methods

Conclusion

Have never over-written our original dataset
Have never created a new object
Real world applications:
- Preparing for plotting or regression
- Summarising data for tables

Already a huge amount to remember!
We’ll be doing more exercises soon

A Word Of Caution

dplyr was written to parallel some SQL functions
\(\implies\) Uses function names from SQL
Some other packages had the same idea much earlier
- e.g. multiple packages contain a filter or select function

If either function gives unexpected output
\(\implies\) call directly from the package (aka namespace)
We can use dplyr::select() instead of select()
- Ensures we use the dplyr version
- Same for dplyr::filter()

Exploring Data In R

The package palmerpenguins

Introducing The Penguins

Exploring The Penguins

Starting An R Script

Starting An R Script

Executing Code

What have we done so far?

The penguins dataset

The penguins dataset

The penguins dataset

Exploring Penguins

Calling Functions

The packagedplyr

Exploring Penguins with dplyr

Exploring Penguins with dplyr

Sorting Penguins

Filtering Penguins

Filtering Penguins

Slicing Penguins

A Brief Diversion

A Brief Diversion

Selecting Penguins

Helper Functions

Helper Functions

Relocating Penguins

Renaming Penguins

Modifying Columns With mutate()

Exercises

Summarising The Dataset

Obtaining Summaries

Obtaining Summaries

Obtaining Summaries

Obtaining Summaries

Obtaining Summaries

Grouping Arguments

Conclusion

Conclusion

A Word Of Caution

The package `palmerpenguins`

The `penguins` dataset

The `penguins` dataset

The `penguins` dataset

The package
`dplyr`

Exploring Penguins with `dplyr`

Exploring Penguins with `dplyr`

Modifying Columns With `mutate()`