library(tidyverse)
Spreadsheet-Like Data
ASI: Introduction to R
Working With
Spreadsheet-Like Data
Section Setup
- Clear your
R
Environment - Create a new script called ‘SpreadsheetLikeData.R’
- Load the tidyverse
We’ll cover:
SQL-
andExcel-
like functions indplyr
- Pivot Tables and Manipulation using
tidyr
- Chaining function together using the pipe operator
The Pipe Operator
- In real-world analyses we often perform multiple steps
- The classic
R
approach was to create a new object at each step- Leads to congested environments with multiple similar objects
- e.g.
data_raw
,data_sorted
,data_subset
- In R v4.1.0 the pipe operator was introduced:
|>
- Takes the output of one function and connects it to the input of another
- An assembly line \(\implies\) like taking the output of one process and sending straight to the input of another process
Returning To The Penguins
- We’ll use
my_penguins
for this session
<- read_csv("data/my_penguins.csv") my_penguins
- This is a
tibble()
(i.e. a conveniently wrappeddata.frame
) - We can easily get key attributes about these objects
dim(my_penguins)
nrow(my_penguins)
ncol(my_penguins)
colnames(my_penguins)
Using the Pipe Operator
- Once we move beyond 2 piped functions, it’s very common to write each one on a separate line
- We could also perform these operations the pipe (
|>
)
|> colnames() my_penguins
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
- This approach combines nicely with
stringr
|>
my_penguins colnames() |>
str_subset(pattern = "bill")
[1] "bill_length_mm" "bill_depth_mm"
Subsetting a Data Frame
- We can subset a
tibble
using[,]
just like adata.frame
1:3, c("species", "island", "body_mass_g")] my_penguins[
# A tibble: 3 × 3
species island body_mass_g
<chr> <chr> <dbl>
1 Adelie Torgersen 3750
2 Adelie Torgersen 3800
3 Adelie Torgersen 3250
Excel-Like Functions in dplyr
- Two common tasks in Excel are 1) sorting a column and 2) applying a filter
- In
dplyr
the relevant functons arearrange()
andfilter()
## Show the penguins in order of weight. This will be ascending by default
|>
my_penguins arrange(body_mass_g)
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinstrap Dream 46.9 16.6 192 2700
2 Adelie Biscoe 36.5 16.6 181 2850
3 Adelie Biscoe 36.4 17.1 184 2850
4 Adelie Biscoe 34.5 18.1 187 2900
5 Adelie Dream 33.1 16.1 178 2900
6 Adelie Torgers… 38.6 17 188 2900
7 Chinstrap Dream 43.2 16.6 187 2900
8 Adelie Biscoe 37.9 18.6 193 2925
9 Adelie Dream 37 16.9 185 3000
10 Adelie Dream 37.3 16.8 192 3000
# ℹ 323 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
Applying a Filter
- We can subset using
filter()
## Show only the penguins heavier than 4kg
|>
my_penguins filter(body_mass_g > 4000)
# A tibble: 167 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.2 19.6 195 4675
2 Adelie Torgersen 34.6 21.1 198 4400
3 Adelie Torgersen 42.5 20.7 197 4500
4 Adelie Torgersen 46 21.5 194 4200
5 Adelie Dream 39.2 21.1 196 4150
6 Adelie Dream 39.8 19.1 184 4650
7 Adelie Dream 44.1 19.7 196 4400
8 Adelie Dream 39.6 18.8 190 4600
9 Adelie Dream 42.3 21.2 191 4150
10 Adelie Biscoe 40.1 18.9 188 4300
# ℹ 157 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
Combining Functions
## Match the patterns Ade or Chin to show only the Adelie or Chinstrap penguins
## Show the results in decreasing order of weight
|>
my_penguins filter(str_detect(species, "Ade|Chin")) |>
arrange(desc(body_mass_g))
# A tibble: 214 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinstrap Dream 52 20.7 210 4800
2 Adelie Biscoe 43.2 19 197 4775
3 Adelie Biscoe 41 20 203 4725
4 Adelie Torgers… 42.9 17.6 196 4700
5 Adelie Torgers… 39.2 19.6 195 4675
6 Adelie Dream 39.8 19.1 184 4650
7 Adelie Dream 39.6 18.8 190 4600
8 Adelie Biscoe 45.6 20.3 191 4600
9 Chinstrap Dream 52.8 20 205 4550
10 Adelie Torgers… 42.5 20.7 197 4500
# ℹ 204 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
Summarising Data
- Creating tables of counts is simple \(\implies\) just pass the column names to
count()
|>
my_penguins count(species, island)
# A tibble: 5 × 3
species island n
<chr> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 55
3 Adelie Torgersen 47
4 Chinstrap Dream 68
5 Gentoo Biscoe 119
Modifying Columns
- So far we’ve only selected, arranged, filtered and summarised existing columns
mutate()
can be used to:- Modify existing columns, or
- Add a new column
- Do we think any of these columns are categorical
- We could turn these into factors!!!
Reshaping Data with tidyr
Using tidyr
- The data structures so far have been great for plotting
- Generally we’ve seen data in long form
- Each value type has it’s own column
- Summary tables can be a little different \(\implies\)
tidyr
- Pivoting Data with
pivot_longer()
andpivot_wider()
- Merging/Separating Columns with
unite()
andseparate()
- Pivoting Data with
- Earlier we found the mean & sd of the weight by species and island
- Let’s include sex then show results in seprate columns
Changing From Long To Wide Form
## Calculate the mean weight by species, island and sex
|>
my_penguins summarise(
mn_weight = mean(body_mass_g),
.by = c(species, island, sex)
)
## Calculate the mean weight by species, island and sex
## Then place values for male & female in separate columns
|>
my_penguins summarise(
mn_weight = mean(body_mass_g),
.by = c(species, island, sex)
|>
) pivot_wider(names_from = sex, values_from = mn_weight)
- This is a great summary table already \(\implies\) how do we include SD?
Combining Columns
- Now we have the mean & standard error
- How do we combine these for a nice printed table \(\implies\)
unite()
- The first argument is the name of the new column
- Then use
tidyselect
helpers to select the columns to “unite”
my_penguins |>
summarise(
mn_weight = mean(body_mass_g),
se_weight = sd(body_mass_g) / sqrt(n()),
.by = c(species, island, sex)
) |>
pivot_wider(names_from = sex, values_from = ends_with("weight")) |>
## Include the underscore for the male columns, but not for the female columns
unite(male, ends_with("_male"), sep = " +/- ") |>
unite(female, ends_with("female"), sep = " +/- ")
Are we happy with this? No. We need to round the values
Applying Functions To Multiple Columns
- Sometimes we need to apply the same function to multiple columns
- The function
across()
is perfect for this situation- Can be applied inside
mutate()
orsummarise()
- Can be applied inside
- First we specify the columns using the
tidyselect
helpers - Then we provide the function to apply
across(.cols, .fns)
Shifting From Wide to Long Form
- The same principles apply when using
pivot_longer()
- Won’t explore in as much detail
## Place all measurements taken in 'mm' into the same column
|>
my_penguins pivot_longer(ends_with("mm"), names_to = "measurement", values_to = "mm")
# A tibble: 999 × 7
species island body_mass_g sex year measurement mm
<fct> <fct> <dbl> <fct> <fct> <chr> <dbl>
1 Adelie Torgersen 3750 male 2007 bill_length_mm 39.1
2 Adelie Torgersen 3750 male 2007 bill_depth_mm 18.7
3 Adelie Torgersen 3750 male 2007 flipper_length_mm 181
4 Adelie Torgersen 3800 female 2007 bill_length_mm 39.5
5 Adelie Torgersen 3800 female 2007 bill_depth_mm 17.4
6 Adelie Torgersen 3800 female 2007 flipper_length_mm 186
7 Adelie Torgersen 3250 female 2007 bill_length_mm 40.3
8 Adelie Torgersen 3250 female 2007 bill_depth_mm 18
9 Adelie Torgersen 3250 female 2007 flipper_length_mm 195
10 Adelie Torgersen 3450 female 2007 bill_length_mm 36.7
# ℹ 989 more rows
Separating Values Into Multiple Columns
## Place all measurements taken in 'mm' into the same column
my_penguins |>
pivot_longer(ends_with("mm"), names_to = "measurement", values_to = "mm") |>
## Now separate the values into body part & measurement type
separate(measurement, c("body_part", "measurement_type", "junk"), sep = "_") |>
## We can get rid of the junk column by using the `-` symbol before the column name
dplyr::select(-junk)
# A tibble: 999 × 8
species island body_mass_g sex year body_part measurement_type mm
<fct> <fct> <dbl> <fct> <fct> <chr> <chr> <dbl>
1 Adelie Torgersen 3750 male 2007 bill length 39.1
2 Adelie Torgersen 3750 male 2007 bill depth 18.7
3 Adelie Torgersen 3750 male 2007 flipper length 181
4 Adelie Torgersen 3800 female 2007 bill length 39.5
5 Adelie Torgersen 3800 female 2007 bill depth 17.4
6 Adelie Torgersen 3800 female 2007 flipper length 186
7 Adelie Torgersen 3250 female 2007 bill length 40.3
8 Adelie Torgersen 3250 female 2007 bill depth 18
9 Adelie Torgersen 3250 female 2007 flipper length 195
10 Adelie Torgersen 3450 female 2007 bill length 36.7
# ℹ 989 more rows
Closing Comments
The Palmer Penguins
- The data we’ve been working with comes from the package
palmerpenguins
- https://allisonhorst.github.io/palmerpenguins/index.html
my_penguins
had allNA
values removed for convenience
library(palmerpenguins)
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Function Conflicts
- Functions in
dplyr
are often based onSQL
functionsdplyr
was first released in 2014 1
- Some other packages already contained a
select()
,filter()
etc- Especially those with a SQL back-end
Footnotes
https://hadley.github.io/25-tidyverse-history/index-preview.html#key-innovations↩︎