library(tidyverse)
Working With Text
RAdelaide 2024
Text Strings
Text Manipulation
- Start a new R script:
text.R
- Next, create the vector we’ll mess around with
<- c("Hi Mum", "Hi Mother", "Hello Parent1") x
Text Manipulation
- Working with
character
vectors - One of the most common and regular tasks
- Cleaning up column names
- Cleaning up data
- Tidying up text on plots
. . .
- Particularly relevant when working with IDs
- May appear as
Run 1
in one file andRun_001
in another
- May appear as
- Data providers often have their ID formats and edit the ones provided
- Need to identify & extract/modify
Regular Expressions
- We’re mostly familiar with words
- Regular Expressions (
regexp
) are incredibly powerful tools in this space regexp
syntax is not unique toR
R
does have a few unique “quirks” though
- Regular Expressions (
. . .
- Will progress to categorical data \(\implies\) factors
Text Manipulation
- The package
stringr
contains functions for text manipulation - Key Functions:
str_detect()
str_remove()
str_extract()
str_replace()
- Alternatives to
grepl()
,grep()
,gsub()
etc. frombase
stringr::str_detect()
str_detect()
returns a logical vector- Same length as the input vector
str_detect(string = x, pattern = "Mum")
stringr::str_detect()
- In the above we used basic words for the search pattern
- Regular expressions use a
.
as a wild card*
has different meaning to many other contexts
str_detect(x, "Mu")
str_detect(x, "M.")
. . .
- The wild-card value
.
obviously needed to followM
in this search
stringr::str_detect()
- Alternative individual characters can be specified within
[]
str_detect(x, "M[ou]")
- Here either an
o
oru
needed to follow theM
. . .
- Alternative sequences of letters can also be specified using
|
|
is common across many languages forOR
str_detect(x, "Mother|Parent")
stringr::str_detect()
- We can also anchor patterns to:
- The start of a stringr (
^
) - The end of a string (
$
)
- The start of a stringr (
str_detect(x, "r")
str_detect(x, "r$")
str_detect(x, "^Hi ")
stringr::str_view()
- We can check our matches in detail using
str_view()
str_view(x, "r")
str_view(x, "r$")
str_view(x, "M[ou]")
str_view(x, "M[ou]", match = NA, html = TRUE)
stringr::str_extract()
- We can use
str_extract()
to extract patterns
str_extract(string = x, pattern = "Hi")
. . .
- The same approach to alternative sequences will also work
str_extract(x, "Mother|Parent")
- If no pattern is found
NA
is returned- This can be helpful if no matches are found
stringr::str_extract()
- Wildcards can also be used
str_extract(x, "Hi M.")
str_extract(x, "Hi [MP].")
. . .
- We can be extremely flexible with characters in the
[]
- In the following, we’re asking for any upper-case letter
str_extract(x, "Hi [A-Z].")
stringr::str_extract()
- We can extend the wildcard using
+
- This means ’repeat the previous pattern 1 or more times`
- .` is a wildcard \(\implies\) repeat the wildcard (not the match) 1 or more times
str_extract(x, "Hi [A-Z].")
str_extract(x, "H.+ [A-Z].")
. . .
- We could restrict the wildcard to be only lower-case letters
str_extract(x, "H.+ [A-Z][a-z]+")
stringr::str_extract()
- Or just be even more general using
[:alpha:]
str_extract(x, "H.+ [:alpha:]+")
- The complete list of classes is at
?base::regex
. . .
- We can also specify the exact number of times for a match
str_extract(x, "H.+ [:alpha:]{3}")
str_extract(x, "H.+ [:alpha:]{4}")
stringr::str_extract_all()
str_extract()
will only return the first match
str_extract(x, "[Hh].")
. . .
str_extract_all()
returns all matches
str_extract_all(x, "[Hh].")
. . .
- Note that now we have a list of the same length as
x
- Each element contains all matches within the initial string
Will talk in detail about lists tomorrow
stringr::str_remove()
- We can use the above tricks to remove patterns within our text
str_remove(x, "Hi ")
str_remove(x, "[aeiou]")
. . .
- Notice that only the first match was removed
str_remove_all()
will remove all occurences
str_remove_all(x, "[aeiou]")
. . .
Very useful for removing file suffixes etc
stringr::str_replace()
str_replace()
is used for extracting/modifying text strings- Even more powerful than
str_extract()
- Even more powerful than
str_replace(x, pattern = "Mum", replacement = "Dad")
- Searching the
string
“Hi Mum” for thepattern
“Mum”, and - Replacing the first instance of “Mum” with “Dad”
stringr::str_replace()
- Wildcards and character sets work in the exact same manner
str_replace(x, "M[a-z]", "Da")
str_replace(x, "[MP].{2}", "Dad")
str_replace(x, "[MP].+", "Dad")
stringr::str_replace()
- The use of capturing patterns makes this extremely flexible
- We can capture words/phrases/patterns using
(pattern)
- Captured patterns are able to be returned in numeric order of capture
- In the following, we capture only one pattern
str_replace(x, "H.+ (.+)", "\\1")
. . .
- Now let’s capture two patterns
str_replace(x, "(H.+) ([:alpha:]+)", "\\2! \\1!")
stringr::str_replace()
- To tidy up the final digit use
*
- Like
+
except the match is zero or more times
- Like
str_replace(x, "(H.+) ([:alpha:]+)[0-9]*", "\\2! \\1!")
stringr::str_replace()
str_replace()
only replaces the first match in a stringstr_replace_all()
replaces all matches
str_replace(x, "[Mm]", "b")
str_replace_all(x, "[Mm]", "b")
More Helpful Functions
str_remove(x, "Hi ")
str_count(x, "[Mm]")
str_length(x)
str_to_lower(x)
str_to_upper(x)
str_split_fixed(x, pattern = " ", n = 2)
str_wrap(x, width = 8)
str_starts(x, "Hi")
str_ends(x, "[:alpha:]")
str_flatten(x, collapse = "; ")
str_trunc(x, width = 7)
str_to_title("a bad example")
. . .
- Pseudo-numeric strings are also handled well
str_pad(c("1", "10", "100"), width = 3, pad = "0")
str_sort(c("1", "10", "2"))
str_sort(c("1", "10", "2"), numeric = TRUE)
Additional Tools and Tricks
- The function
paste()
is a very useful one- The default separator is
" "
paste0()
has the default separator as""
- The default separator is
paste(x, "How are you?")
paste(x, "How are you?", sep = ". ")
paste0(x, "!")
paste(x, collapse = "! ")
Additional Tools and Tricks
- The package
glue
has revolutionised text manipulation- We can pass R objects or function calls to the middle of a text string
- We do need to be careful with quotation marks here
library(glue)
glue("When they answered, I said '{x}!'")
glue("I call them {str_remove(x, 'H.+ ')}")
glue_collapse(letters, sep = ", ", last = " & ")
. . .
- Output is of class
glue
- Coerces back to
character
- Plays very well with advanced
tidyverse
syntax (e.g.rlang
)
- Coerces back to
Working With Strings
- Is an incredibly common and important part of working with R
- Extract sample IDs from file names
- Pull key information from columns
- Remove prefixes/suffixes
- Correct data entry errors
- Format for pretty plotting
Factors
Factors
A common data type in statistics is a categorical variable (i.e. a factor
)
- Can appear to be a
character
vector/column- Can easily trip an unsuspecting analyst up
- Data will be a set of common groups/categories
<- c("Dog", "Dog", "Cat", "Dog", "Bird", "Bird") pet_vec
- This is a
character
vector
Factors
- We can simply coerce this to a vector of factors
- Categories will automatically be assigned alphabetically using
as.factor()
<- as.factor(pet_vec)
pet_factors pet_factors
We can manually set these categories as levels
using factor()
<- factor(pet_vec, levels = c("Dog", "Cat", "Bird"))
pet_factors pet_factors
Factors
- These are actually stored as integers
- Each integer corresponds to a
level
str(pet_factors)
as.integer(pet_factors)
as.character(pet_factors)
A Potential Pitfall
What would happen if we think a factor
is a character
, and we use it to select values from a vector
/matrix
/data.frame
?
. . .
names(pet_vec) <- pet_vec
pet_vec pet_vec[pet_factors]
. . .
read_csv()
and otherreadr
functions always parse text as acharacter
- Older versions of
read.csv()
parsed text to factors by default - Changed with R \(\geq\) v4.0.0
- Older versions of
- If I want a
factor
, I explicitly make afactor
- During statistical analysis
character
vectors are always coerced
- During statistical analysis
The package forcats
forcats
is a part of the coretidyverse
- Specifically for wrangling
factors
- Also plays very nicely with
stringr
- Specifically for wrangling
Some Handy Tricks
fct_inorder()
sets categories in the order they appear- Sort your
data.frame
then applyfct_inorder()
for nice structured plots
- Sort your
fct_inorder(pet_vec)
. . .
fct_infreq()
sets categories by their frequency
fct_infreq(pet_vec)
Some Handy Tricks
- Collapse categories with fewer than
n
entries
fct_lump_min(pet_vec, min = 3)
. . .
- Collapse categories with fewer than
p
entries
fct_lump_prop(pet_vec, prop = 0.3)
. . .
- Reverse the order (will automatically coerce to a factor)
fct_rev(pet_vec)