library(tidyverse)
ASI: Introduction to R
September 3, 2025
Wrangling text is a common task using R
ggplot
stringr
tidyverse
forcats
tidyverse
[1] "APPLE PIE" "BANANA SPLIT" "CHERRY TART" "APPLE CRUMBLE"
[5] "BANANA BREAD"
[1] "Apple Pie" "Banana Split" "Cherry Tart" "Apple Crumble"
[5] "Banana Bread"
[1] "Apple pie" "Banana split" "Cherry tart" "Apple crumble"
[5] "Banana bread"
str_to_lower()
won’t have any effect here...
in the last 3 positions[1] FALSE TRUE FALSE FALSE TRUE
[1] "apple-pie" "banana-split" "cherry-tart" "apple-crumble"
[5] "banana-bread"
[1] "upple pie" "bunana split" "cherry turt" "upple crumble"
[5] "bunana bread"
_all
versions_all
versionsstr_extract_all()
produces an R
object known as a list
regex
exist in most languages (e.g. python
, bash
etc)R
does have some unique syntax[]
[aeiou]
would match any vowel[abc]
would match either a
, b
or c
[A-Z]
for any uppercase letter[a-z]
for any lowercase letter[0-9]
for any number[:alpha:]
matches any alphabetic character[:alnum:]
matches any alpha-numeric character?regex
[1] "ppl p" "bnn splt" "chrry trt" "ppl crmbl" "bnn brd"
regex
wildcards is .
*
has a different meaning+
*
matches zero or more times.+
\(\implies\) one or more wildcards()
“captures” the pattern
1
R
we return captures using \\1
, \\2
etc## Capture each word, then return in the opposite order
str_replace_all(treats, "([a-z]+) ([a-z]+)", "\\2 \\1")
[1] "pie apple" "split banana" "tart cherry" "crumble apple"
[5] "bread banana"
## Just return one of the words amongst other text
str_replace_all(treats, "([a-z]+) ([a-z]+)", "I'd like \\2 for dessert")
[1] "I'd like pie for dessert" "I'd like split for dessert"
[3] "I'd like tart for dessert" "I'd like crumble for dessert"
[5] "I'd like bread for dessert"
^
[1] TRUE TRUE TRUE TRUE TRUE
[1] TRUE FALSE FALSE TRUE FALSE
^
inside []
negates the character inside []
^
, $
, .
, +
, *
, [
, ]
, (
, )
) is very powerful.
within a file name\\
\\1
returns captures instead of the value 1
.
exactly, we use \\.
0
at the beginning of these phone numbers with +61
Note
Most challenges can be solved more than one way. The right way is the one that works!
R
forcats
library(tidyverse)
tibble
as.factor()
will set categories (i.e. levels) in alpha-numeric order
[1] High Med Low
Levels: High Low Med
forcats
as.factor()
and factor()
are in the package base
forcats
provides as_factor()
and fct()
fct_infreq()
fct_rev()
fct_lump()
, fct_lump_n()
, fct_lump_prop()
forcats
NA
values can be set to a specific level
fct_na_value_to_level()
stringr
functions can be used to tidy levels:
fct_relabel(f, .fun)