library(tidyverse)
ASI: Introduction to R
September 3, 2025
R
Environment
We’ll cover:
SQL-
and Excel-
like functions in dplyr
tidyr
R
approach was to create a new object at each step
data_raw
, data_sorted
, data_subset
|>
R
ran an assembly line making toy carsmy_penguins
for this session|>
)[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
tibble
using [,]
just like a data.frame
dplyr
provides two separate functions for thisselect()
allows for selecting columns by position or nameslice()
allows for cutting out rows by position only
filter()
(coming soon) allows for returning rows by a key valuestarts_with()
, ends_with()
contains()
, matches()
all_of()
, any_of()
everything()
where()
any_of()
and all_of()
take character vectorsdplyr
dplyr
the relevant functons are arrange()
and filter()
## Show the penguins in order of weight. This will be ascending by default
my_penguins |>
arrange(body_mass_g)
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinstrap Dream 46.9 16.6 192 2700
2 Adelie Biscoe 36.5 16.6 181 2850
3 Adelie Biscoe 36.4 17.1 184 2850
4 Adelie Biscoe 34.5 18.1 187 2900
5 Adelie Dream 33.1 16.1 178 2900
6 Adelie Torgers… 38.6 17 188 2900
7 Chinstrap Dream 43.2 16.6 187 2900
8 Adelie Biscoe 37.9 18.6 193 2925
9 Adelie Dream 37 16.9 185 3000
10 Adelie Dream 37.3 16.8 192 3000
# ℹ 323 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
dplyr
desc()
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 49.2 15.2 221 6300
2 Gentoo Biscoe 59.6 17 230 6050
3 Gentoo Biscoe 51.1 16.3 220 6000
4 Gentoo Biscoe 48.8 16.2 222 6000
5 Gentoo Biscoe 45.2 16.4 223 5950
6 Gentoo Biscoe 49.8 15.9 229 5950
7 Gentoo Biscoe 48.4 14.6 213 5850
8 Gentoo Biscoe 49.3 15.7 217 5850
9 Gentoo Biscoe 55.1 16 230 5850
10 Gentoo Biscoe 49.5 16.2 229 5800
# ℹ 323 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
filter()
# A tibble: 167 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.2 19.6 195 4675
2 Adelie Torgersen 34.6 21.1 198 4400
3 Adelie Torgersen 42.5 20.7 197 4500
4 Adelie Torgersen 46 21.5 194 4200
5 Adelie Dream 39.2 21.1 196 4150
6 Adelie Dream 39.8 19.1 184 4650
7 Adelie Dream 44.1 19.7 196 4400
8 Adelie Dream 39.6 18.8 190 4600
9 Adelie Dream 42.3 21.2 191 4150
10 Adelie Biscoe 40.1 18.9 188 4300
# ℹ 157 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
filter()
relies on logical tests being performed on the column of interestSymbol | Meaning |
---|---|
== |
Is equal to |
!= |
Is NOT equal to |
>= / <= |
Is greater/less than or equal to |
> / < |
Is greater/less than |
%in% |
Is IN a set of values |
is.na() |
Is a missing value (i.e. NA ) |
!
means NOT## Show only the Chinstrap penguins heavier than 4kg
my_penguins |>
filter(species == "Chinstrap", body_mass_g > 4000)
# A tibble: 15 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinstrap Dream 46 18.9 195 4150
2 Chinstrap Dream 52 18.1 201 4050
3 Chinstrap Dream 50.5 19.6 201 4050
4 Chinstrap Dream 49.2 18.2 195 4400
5 Chinstrap Dream 52 19 197 4150
6 Chinstrap Dream 52.8 20 205 4550
7 Chinstrap Dream 54.2 20.8 201 4300
8 Chinstrap Dream 51 18.8 203 4100
9 Chinstrap Dream 52 20.7 210 4800
10 Chinstrap Dream 53.5 19.9 205 4500
11 Chinstrap Dream 50.8 18.5 201 4450
12 Chinstrap Dream 49 19.6 212 4300
13 Chinstrap Dream 50.7 19.7 203 4050
14 Chinstrap Dream 49.3 19.9 203 4050
15 Chinstrap Dream 50.8 19 210 4100
# ℹ 2 more variables: sex <chr>, year <dbl>
## Show the Adelie or Chinstrap penguins
my_penguins |>
filter(species %in% c("Adelie", "Chinstrap"))
# A tibble: 214 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
## Alternatively, exclude the Gentoo penguins. This does the same as previously
my_penguins |>
filter(species != "Gentoo")
# A tibble: 214 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
str_detect()
can also play a role here## Match the patterns Ade or Chin to show only the Adelie or Chinstrap penguins
my_penguins |>
filter(str_detect(species, "Ade|Chin"))
# A tibble: 214 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
## Match the patterns Ade or Chin to show only the Adelie or Chinstrap penguins
## Show the results in decreasing order of weight
my_penguins |>
filter(str_detect(species, "Ade|Chin")) |>
arrange(desc(body_mass_g))
# A tibble: 214 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinstrap Dream 52 20.7 210 4800
2 Adelie Biscoe 43.2 19 197 4775
3 Adelie Biscoe 41 20 203 4725
4 Adelie Torgers… 42.9 17.6 196 4700
5 Adelie Torgers… 39.2 19.6 195 4675
6 Adelie Dream 39.8 19.1 184 4650
7 Adelie Dream 39.6 18.8 190 4600
8 Adelie Biscoe 45.6 20.3 191 4600
9 Chinstrap Dream 52.8 20 205 4550
10 Adelie Torgers… 42.5 20.7 197 4500
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
count()
## Create a barplot with error bars
my_penguins |>
summarise(
mn_weight = mean(body_mass_g),
sd_weight = sd(body_mass_g),
.by = species
) |>
ggplot(
aes(species, mn_weight, fill = species)
) +
geom_col() +
geom_errorbar(
aes(
ymin = mn_weight - sd_weight,
ymax = mn_weight + sd_weight
),
width = 0.3
) +
scale_fill_brewer(palette = "Set1") +
theme_bw()
mutate()
can be used to:
my_penguins |>
mutate(
species = fct(species),
island = fct(island),
## The default order for `fct()` will be in order of appearance
## Set levels manually to place female first
sex = fct(sex, levels = c("female", "male")),
## `fct()` requires a character vector
## We can also pipe internally!!!
year = year |> as.character() |> fct()
)
Have we changed the original data?
No, we’ve just changed it on the fly as we inspect it
my_penguins <- read_csv("data/my_penguins.csv") |>
mutate(
species = fct(species),
island = fct(island),
## The default order for `fct()` will be in order of appearance
## Set levels manually to place female first
sex = fct(sex, levels = c("female", "male")),
## `fct()` requires a character vector
## We can also pipe internally!!!
year = year |> as.character() |> fct()
)
tidyr
tidyr
tidyr
pivot_longer()
and pivot_wider()
unite()
and separate()
tidyselect
helper ends_with()
we saw earlier will work here toomy_penguins |>
summarise(
mn_weight = mean(body_mass_g),
## This is a fitted value so standard error may be more appropriate
se_weight = sd(body_mass_g) / sqrt(n()),
.by = c(species, island, sex)
) |>
## Use `ends_with("weight")` to pivot the appropriate columns
pivot_wider(names_from = sex, values_from = ends_with("weight"))
unite()
tidyselect
helpers to select the columns to “unite”my_penguins |>
summarise(
mn_weight = mean(body_mass_g),
se_weight = sd(body_mass_g) / sqrt(n()),
.by = c(species, island, sex)
) |>
pivot_wider(names_from = sex, values_from = ends_with("weight")) |>
## Include the underscore for the male columns, but not for the female columns
unite(male, ends_with("_male"), sep = " +/- ") |>
unite(female, ends_with("female"), sep = " +/- ")
Are we happy with this? No. We need to round the values
across()
is perfect for this situation
mutate()
or summarise()
tidyselect
helpersacross(.cols, .fns)
across(ends_with("male"), round)
within a mutate()
my_penguins |>
summarise(
mn_weight = mean(body_mass_g),
## This is a fitted value so standard error may be more appropriate
se_weight = sd(body_mass_g) / sqrt(n()),
.by = c(species, island, sex)
) |>
## Use `ends_with("weight")` to pivot the appropriate columns
pivot_wider(names_from = sex, values_from = ends_with("weight")) |>
## Apply the function `round()` to all four columns ending with male
## This will default to zero digits, which is fine for this table
mutate(across(ends_with("male"), round)) |>
unite(male, ends_with("_male"), sep = " +/- ") |>
unite(female, ends_with("female"), sep = " +/- ")
pivot_longer()
## Place all measurements taken in 'mm' into the same column
my_penguins |>
pivot_longer(ends_with("mm"), names_to = "measurement", values_to = "mm")
# A tibble: 999 × 7
species island body_mass_g sex year measurement mm
<fct> <fct> <dbl> <fct> <fct> <chr> <dbl>
1 Adelie Torgersen 3750 male 2007 bill_length_mm 39.1
2 Adelie Torgersen 3750 male 2007 bill_depth_mm 18.7
3 Adelie Torgersen 3750 male 2007 flipper_length_mm 181
4 Adelie Torgersen 3800 female 2007 bill_length_mm 39.5
5 Adelie Torgersen 3800 female 2007 bill_depth_mm 17.4
6 Adelie Torgersen 3800 female 2007 flipper_length_mm 186
7 Adelie Torgersen 3250 female 2007 bill_length_mm 40.3
8 Adelie Torgersen 3250 female 2007 bill_depth_mm 18
9 Adelie Torgersen 3250 female 2007 flipper_length_mm 195
10 Adelie Torgersen 3450 female 2007 bill_length_mm 36.7
# ℹ 989 more rows
## Place all measurements taken in 'mm' into the same column
my_penguins |>
pivot_longer(ends_with("mm"), names_to = "measurement", values_to = "mm") |>
## Now separate the values into body part & measurement type
separate(measurement, c("body_part", "measurement_type", "junk"), sep = "_") |>
## We can get rid of the junk column by using the `-` symbol before the column name
dplyr::select(-junk)
# A tibble: 999 × 8
species island body_mass_g sex year body_part measurement_type mm
<fct> <fct> <dbl> <fct> <fct> <chr> <chr> <dbl>
1 Adelie Torgersen 3750 male 2007 bill length 39.1
2 Adelie Torgersen 3750 male 2007 bill depth 18.7
3 Adelie Torgersen 3750 male 2007 flipper length 181
4 Adelie Torgersen 3800 female 2007 bill length 39.5
5 Adelie Torgersen 3800 female 2007 bill depth 17.4
6 Adelie Torgersen 3800 female 2007 flipper length 186
7 Adelie Torgersen 3250 female 2007 bill length 40.3
8 Adelie Torgersen 3250 female 2007 bill depth 18
9 Adelie Torgersen 3250 female 2007 flipper length 195
10 Adelie Torgersen 3450 female 2007 bill length 36.7
# ℹ 989 more rows
palmerpenguins
my_penguins
had all NA
values removed for convenience# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
dplyr
are often based on SQL
functions
dplyr
was first released in 2014 1select()
, filter()
etc
dplyr
function gives a strange error select()
, filter()
, slice()
, count()
, n()
dplyr
using the namespace
dplyr::select()
, dplyr::filter()
, dplyr::count()
etcnamespace
is the formal title for all function names within a packageconflicted
also enables setting of which function to use