## Load the palmerpenguins package
library(palmerpenguins)
Exploring Data In R
RAdelaide 2025
The package palmerpenguins
Introducing The Penguins
- We’ll be looking at the “Palmer Penguins” dataset
- Taken from https://allisonhorst.github.io/palmerpenguins/index.html
- 3 species of penguins from the Palmer Archipelago, Antarctica
- Various physiological measurements
Exploring The Penguins
- We won’t be creating any objects in this section
- Learning how to explore a dataset using
dplyr
- For organising data
- For creating summary tables
- To prepare for creating plots & figures
- Is a core
tidyverse
package
- We’ll cover a huge amount of ground
- Hopefully the exercises & challenges help
Starting An R Script
- Best practice is to ALWAYS record your code
- Today we’ll use an R script
- Is a plain text file
- Is a combination of code and comments
- The filename should end with
.R
- Nothing we enter in the script is executed
\(\implies\) until we intentionally execute the code
Executing Code
- So far, no code has been executed from this script
- Check your Environment Tab to see if there are any objects
- If there is an object (most likely
x
) \(\implies\) click the broom icon - This will clear any existing objects from the environment
- If there is an object (most likely
- Place the cursor on the line of code
library(palmerpenguins)
, or - Use the keyboard shortcut
Ctrl + Enter
(Windows/Linux) orCmd + Enter
(Mac), or - Click the
Run
button in the top right of the script editor
What have we done so far?
- We have simply loaded the
palmerpenguins
package- We called the function
library()
- This loads all the functions and the data in a requested package
- The package name appears inside the parentheses
()
- Very similar to calling
sqrt(5)
as we saw earlier
- We called the function
- For python users the equivalent would be
import palmerpenguins
The penguins
dataset
- The
palmerpenguins
package contains thepenguins
dataset already loaded- Add the comment and code below, then execute
# Let's look at the penguins dataset
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
- Maybe get people to click on the Global Environment Tab
Exploring Penguins
- A common initial data exploration task is to get a summary of the data
# The column names are:
colnames(penguins)
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
Copy the column names after the code, then comment
- Highlight the output in the console
- Copy & paste into the script
- Make a comment by adding
#
at the start of each line
Commenting Multiple Lines
- Comments can be toggled on/off across multiple lines using:
Ctrl + Shift + C
(Win/Linux)Cmd + Shift + C
(Mac)
Calling Functions
- Notice that we placed the object inside the parentheses
()
after the function
- Let’s continue checking the object
# Find out how big the dataset is
nrow(penguins)
ncol(penguins)
dim(penguins)
tail(penguins)
- Can you figure out what each of these functions does?
The package
dplyr
Exploring Penguins with dplyr
- The package
dplyr
\(\rightarrow\) functions for data exploration and manipulation - Let’s load the package as part of the
tidyverse
- I personally load all packages at the start of a script
- Add this underneath the
library(palmerpenguins)
line - All functions in this section from
dplyr
library(tidyverse) # Load the complete tidyverse
- We’ll use these functions to explore the
penguins
dataset - Then we can modify the dataset
dplyr
was written initially by Hadley Wickham- He’s originally from NZ so he knows how to spell things correctly
Sorting Penguins
dplyr
provides some Excel-like functions:arrange()
will sort the datafilter()
will filter the data
# Sort the penguins by body mass in increasing order
arrange(penguins, body_mass_g)
# Sort the penguins by body mass in decreasing order
arrange(penguins, desc(body_mass_g))
# Sort multiple columns in the order passed to the function
arrange(penguins, species, body_mass_g)
Filtering Penguins
Filtering relies on logical tests
Symbol | Description |
---|---|
== |
Is exactly Equal To |
> / < |
Is Greater/Less Than |
>= / <= |
Is Greater/Less Than or Equal To |
!= |
Is Not Equal To |
is.na() |
Is Missing Value |
%in% |
Is in a set of possible values |
- In most languages,
!
is interpreted as NOT
Slicing Penguins
filter()
returns the rows that match a given criteriaslice()
can be used to return rows by position
## Slice out the first 10 rows of the penguins dataset
slice(penguins, 1:10)
## Now slice out the 101st to 110th rows
slice(penguins, 101:110)
A Brief Diversion
- In the two previous examples we used a sequence of consecutive values
1:10
[1] 1 2 3 4 5 6 7 8 9 10
101:110
[1] 101 102 103 104 105 106 107 108 109 110
- We refer to one or more values in
R
as a vector- These are integer vectors
- Integers are often used to denote rows/columns etc
Selecting Penguins
filter()
andslice()
can be used to return rowsselect()
can be used to return columns
# Columns can be 'selected' by passing the required column names
select(penguins, species, island, body_mass_g)
# Columns can also be selected by position using a vector
select(penguins, c(1, 2, 6))
Using Names Or Position
- Do the above lines give the same result?
- Would either one be preferable?
Helper Functions
dplyr
provides some helper functions to make selecting columns easierstarts_with()
,ends_with()
andcontains()
, are very useful!everything()
is also surprisingly usefulany_of()
andall_of()
are a bit more advanced
# Select all columns that start with "bill", after the species and island columns
select(penguins, species, island, starts_with("bill"))
# Select all length-related columns, after the species and island columns
select(penguins, species, island, contains("length"))
Relocating Penguins
# Relocate is a newer addition to dplyr and can also be used to reorder columns
# The arguments .before and .after can be used to specify where to place columns
# Here we're moving columns with an underscore to after the year column
relocate(penguins, contains("_"), .after = year)
# This time, we're moving the sex and year columns to 'before' the bill columns
relocate(penguins, sex, year, .before = starts_with("bill"))
Renaming Penguins
- When we call select, we can rename columns on the fly
## Rename the island column as 'location', leaving the order unchanged
select(penguins, species, location = island, everything())
- Alternatively, we can just use
rename()
## Or just rename the individual column
rename(penguins, location = island)
Modifying Columns With mutate()
- So far, we’ve only subset our data using various methods
mutate()
is used to modify existing columns or create new ones
# Create a column called `body_mass_kg` that is the body mass in kg
mutate(penguins, body_mass_kg = body_mass_g / 1000)
Exercises
- Use
filter()
to find all female penguins- Then find all female penguins with a flipper length greater than 215mm
- Use
filter()
to find all penguins wheresex
is missing (NA
) - Sort the dataset by bill_length in descending order
- Use
select()
to return the species, island, and year columns- Repeat trying an alternative approach to your first answer
- Place the
year
column afterisland
and remove thesex
column - Create the column
bill_ratio
by dividing bill length by depth
Summarising The Dataset
Obtaining Summaries
dplyr
also provides functions to summarise datacount()
andsummarise()
are the most common- We can tell these functions which columns to summarise by
# Count the number of penguins by species
count(penguins, species)
# Count the number of penguins by species and island
count(penguins, species, island)
# If we change the order of the columns, we get a different order in our results
count(penguins, island, species)
# The argument `sort = TRUE` will sort the results
count(penguins, species, island, sort = TRUE)
Grouping Arguments
- The recent addition of
.by
has beefed up some earlier functions
# Grab the first 5 rows of each species
slice(penguins, 1:5, .by = species)
- Some newer extensions of
slice()
are also useful for summarising dataslice_min()
,slice_max()
# Return the heaviest penguin from each species
slice_max(penguins, body_mass_g, n = 1, by = species)
- Confusingly, the argument
.by
has becomeby
here 🤷🏻- The argument
n = 1
says return only one penguin per species
- The argument
Conclusion
- All functions so far have enabled exploration
arrange()
,filter
select()
+ helper functionsslice()
,slice_max()
,slice_min()
count()
,summarise()
- Many others we didn’t cover
slice_head()
,slice_tail()
,slice_sample()
- Multiple join methods
A Word Of Caution
dplyr
was written to parallel some SQL functions
\(\implies\) Uses function names from SQL- Some other packages had the same idea much earlier
- e.g. multiple packages contain a
filter
orselect
function
- e.g. multiple packages contain a
- If either function gives unexpected output
\(\implies\) call directly from the package (aka namespace) - We can use
dplyr::select()
instead ofselect()
- Ensures we use the
dplyr
version - Same for
dplyr::filter()
- Ensures we use the