library(tidyverse)
Working With Text
RAdelaide 2025
Text Strings
Text Manipulation
- Start a new R script:
text.R
- Next, create the vector we’ll mess around with
## Create a character vector for this session
<- c("Hi Mum", "Hi Mother", "Hello Maternal Parent") x
Regular Expressions
- We’re mostly familiar with words
- Regular Expressions (
regexp
) are incredibly powerful tools in this space regexp
syntax is not unique toR
R
does have a few unique “quirks” though
- Regular Expressions (
- Will progress to categorical data \(\implies\) factors
Text Manipulation
- The package
stringr
contains functions for text manipulation - Key Functions:
str_detect()
str_remove()
str_extract()
str_replace()
- Alternatives to
grepl()
,grep()
,gsub()
etc. frombase
stringr::str_detect()
str_detect()
returns a logical vector \(\implies\) same length as the input vector
## Return a logical vector where 'x' matches the pattern 'Mat'
str_detect(string = x, pattern = "Mat")
- How would we search for either “Mat” or “Mot”?
- We can pass alternative sets of letters in square brackets
[]
## Use the alternative letters a|o between the M & t
str_detect(string = x, pattern = "M[ao]t")
stringr::str_remove()
- To remove words or patterns
str_remove(x, "M")
str_remove(x, "Hi ")
str_remove(x, " ")
- Why did that last one only remove the first space?
stringr::str_remove_all()
str_remove()
will only remove the first matchstr_remove_all()
will remove all matches
str_remove_all(x, " ")
- Using
regex
syntax \(\implies\) pass sets of letters using[]
syntax
## Remove all vowels
str_remove_all(x, "[aeiou]")
## Remove only the first vowel
str_remove(x, "[aeiou]")
stringr::str_extract()
- In regular expressions we can extend a match using
+
\(\implies\) match one or more characters
str_extract(x, "H[a-z]+")
- This will match the first
H
and then all following lower-case letters- The match stops at the whitespace \(\implies\) not in the set
[a-z]
- The match stops at the whitespace \(\implies\) not in the set
stringr::str_extract_all()
str_extract()
will only return the first match
str_extract(x, "[Hh].")
str_extract_all()
returns all matches
str_extract_all(x, "[Hh].")
- Note that now we have a list of the same length as
x
- Each element contains all matches within the initial string
Will talk in detail about lists tomorrow
Regular Expressions
Regular Expressions
- We’re mostly familiar with matching words
regex
allows more powerful matching
- Can include wildcards (
.
) - Can specify sets of values (
[a-z]
)[A-Z]
for upper-case[0-9]
for numbers[:alnum:]
represents all alpha-numeric characters
- Can extend matches using
+
for one or more- None or more can be specified using
*
- None or more can be specified using
stringr::str_view()
- We can check our matches in detail using
str_view()
str_view(x, "r")
str_view(x, "r$")
str_view(x, "M[aeiou]t")
str_view(x, "^[^ ]+", match = NA, html = TRUE)
stringr::str_replace()
str_replace()
is used for extracting/modifying text strings- Even more powerful than
str_extract()
- Even more powerful than
str_replace(x, pattern = "Mum", replacement = "Dad")
- Searching the
string
“Hi Mum” for thepattern
“Mum”, and - Replacing the first instance of “Mum” with “Dad”
Brief Summary
str_detect()
\(\implies\) logical vectorstr_remove()
/str_remove_all()
\(\implies\) remove matching patternsstr_extract()
\(\implies\) extract matching patternsstr_replace()
/str_replace_all()
\(\implies\) modify a character vector
Additional Functions
More Helpful Functions
str_count(x, "[Mm]")
str_length(x)
str_to_lower(x)
str_to_upper(x)
str_split_fixed(x, pattern = " ", n = 2)
str_wrap(x, width = 8)
str_starts(x, "Hi")
str_ends(x, "[rt]")
str_flatten(x, collapse = "; ")
str_trunc(x, width = 7)
str_to_title("a bad example")
Additional Tools and Tricks
- The function
paste()
is a very useful one- The default separator is
" "
paste0()
has the default separator as""
- The default separator is
paste(x, "How are you?")
paste(x, "How are you?", sep = ". ")
paste0(x, "!")
paste(x, collapse = "! ")
Working With Strings
- Is an incredibly common and important part of working with R
- Extract sample IDs from file names
- Pull key information from columns
- Remove prefixes/suffixes
- Correct data entry errors
- Format for pretty plotting
Challenges: Slide 1
- Given a vector of Transcript IDs with versions, remove the version number?
<- c("ENST00000376207.10", "ENST00000376199.7") ids
- Add the ‘chr’ prefix to these chromosomes
<- c(1:22, "X", "Y", "M") chr
- Pull the chromosome out of these cytogenetic bands
<- c("Xp11.23", "11q2.3", "2p7.1") cyto
- Change these phone numbers to start with
+61
instead of0
<- c("0499123456", "0498760432") phones
Challenges: Slide 2
- Remove the suffix “.bam” from these filenames
<- c("rna_bamboo1.bam", "rna_rice1.bam", "rna_wheat1.bam") bams
- Correct the responses to be consistent (choose the format)
<- c("Y", "yes", "No", "no") response
- Correct these recorded values to be consistently
M/F
orMale/Female
<- c("M", "male", "femal", "Female") sex
Factors
Factors
A common data type in statistics is a categorical variable (i.e. a factor
)
- Can appear to be a
character
vector/column- Can easily trip an unsuspecting data scientist up
- Data will be a set of common groups/categories
<- c("Dog", "Dog", "Cat", "Dog", "Bird", "Bird") pet_vec
- This is a
character
vector
A Potential Pitfall
What would happen if we think a factor
is a character
, and we use it to select values from a vector
/matrix
/data.frame
?
names(pet_vec) <- pet_vec # Set the names as equal to the values
as.character(pet_factors)]
pet_vec[
pet_vec[pet_factors]as.integer(pet_factors)] pet_vec[
read_csv()
and otherreadr
functions always parse text as acharacter
- Older versions of
read.csv()
parsed text to factors by default - Changed with R \(\geq\) v4.0.0
- Older versions of
- If I want a
factor
, I explicitly make afactor
- During statistical analysis
character
vectors are always coerced
- During statistical analysis
The package forcats
forcats
is a part of the coretidyverse
- Specifically for wrangling
factors
- Also plays very nicely with
stringr
- Specifically for wrangling
as.factor()
andfactor(levels = ...)
arebase
functions- Most
forcats
functions start withfct_
or use_
as_factor()
parallelsas.factor()
- But uses the order of appearance, not alpha-numeric sorting
fct()
replicatesfactor()
with stricter error handling
Some Handy Tricks
fct_inorder()
sets categories in the order they appear- Sort your
data.frame
then applyfct_inorder()
for nice structured plots
- Sort your
fct_inorder(pet_vec)
fct_infreq()
sets categories by their frequency
fct_infreq(pet_vec)