## Load the palmerpenguins package
library(palmerpenguins)
RAdelaide 2025
July 8, 2025
palmerpenguins
dplyr
tidyverse
package.R
DataExploration.R
type the following#
symbol indicates a commentx
) \(\implies\) click the broom iconlibrary(palmerpenguins)
, orCtrl + Enter
(Windows/Linux) or Cmd + Enter
(Mac), orRun
button in the top right of the script editorpalmerpenguins
package
library()
()
sqrt(5)
as we saw earlierimport palmerpenguins
penguins
datasetpalmerpenguins
package contains the penguins
dataset already loaded
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins
datasetpenguins
dataset is a tibble
344 x 8
)
penguins
dataset<fct>
means ‘factor’ \(\implies\) a categorical variable<dbl>
means ‘double’ \(\implies\) a numeric variable<int>
means ‘integer’ \(\implies\) a whole number[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
Copy the column names after the code, then comment
#
at the start of each lineCommenting Multiple Lines
Ctrl + Shift + C
(Win/Linux)Cmd + Shift + C
(Mac)()
after the functiondplyr
dplyr
dplyr
\(\rightarrow\) functions for data exploration and manipulationtidyverse
library(palmerpenguins)
linedplyr
penguins
datasetdplyr
## The `glimpse()` function is provided by dplyr
## Can be very helpful with large column numbers
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
<-
)head()
, tail()
, and glimpse()
are all printing to the consoledplyr
provides some Excel-like functions:
arrange()
will sort the datafilter()
will filter the dataFiltering relies on logical tests
Symbol | Description |
---|---|
== |
Is exactly Equal To |
> / < |
Is Greater/Less Than |
>= / <= |
Is Greater/Less Than or Equal To |
!= |
Is Not Equal To |
is.na() |
Is Missing Value |
%in% |
Is in a set of possible values |
!
is interpreted as NOT## Subset the data to those from the Island of Dream
filter(penguins, island == "Dream")
## Subset the data to those NOT from the Island of Dream
filter(penguins, island != "Dream")
## Subset the penguins to those lighter than 4000g
filter(penguins, body_mass_g < 4000)
## Find the penguins from Dream that are heavier than 4000g
filter(penguins, island == "Dream", body_mass_g > 4000)
filter()
returns the rows that match a given criteriaslice()
can be used to return rows by positionR
as a vector
R
we can form a vector by combining values with the function c()
filter()
and slice()
can be used to return rowsselect()
can be used to return columnsUsing Names Or Position
dplyr
provides some helper functions to make selecting columns easierstarts_with()
, ends_with()
and contains()
, are very useful!everything()
is also surprisingly usefulany_of()
and all_of()
are a bit more advanced# Relocate is a newer addition to dplyr and can also be used to reorder columns
# The arguments .before and .after can be used to specify where to place columns
# Here we're moving columns with an underscore to after the year column
relocate(penguins, contains("_"), .after = year)
# This time, we're moving the sex and year columns to 'before' the bill columns
relocate(penguins, sex, year, .before = starts_with("bill"))
mutate()
mutate()
is used to modify existing columns or create new onesfilter()
to find all female penguins
filter()
to find all penguins where sex
is missing (NA
)select()
to return the species, island, and year columns
year
column after island
and remove the sex
columnbill_ratio
by dividing bill length by depthdplyr
also provides functions to summarise data
count()
and summarise()
are the most common# Count the number of penguins by species
count(penguins, species)
# Count the number of penguins by species and island
count(penguins, species, island)
# If we change the order of the columns, we get a different order in our results
count(penguins, island, species)
# The argument `sort = TRUE` will sort the results
count(penguins, species, island, sort = TRUE)
summarise()
.by
c()
)count()
we can call n()
as part of summarise()
.by
has beefed up some earlier functionsslice()
are also useful for summarising data
slice_min()
, slice_max()
.by
has become by
here 🤷🏻
n = 1
says return only one penguin per speciesarrange()
, filter
select()
+ helper functionsslice()
, slice_max()
, slice_min()
count()
, summarise()
slice_head()
, slice_tail()
, slice_sample()
dplyr
was written to parallel some SQL functions filter
or select
functiondplyr::select()
instead of select()
dplyr
versiondplyr::filter()