Working With Text

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 9, 2025

Text Strings

Text Manipulation

  • Start a new R script: text.R
library(tidyverse)
  • Next, create the vector we’ll mess around with
## Create a character vector for this session
x <- c("Hi Mum", "Hi Mother", "Hello Maternal Parent")

Text Manipulation

  • Working with character vectors
  • One of the most common and regular tasks
    • Cleaning up column names
    • Cleaning up data
    • Tidying up text on plots
  • Particularly relevant when working with IDs
    • May appear as Run 1 in one file and Run_001 in another
  • Data providers often have their ID formats and edit the ones provided
    • Need to identify & extract/modify

Regular Expressions

  • We’re mostly familiar with words
    • Regular Expressions (regexp) are incredibly powerful tools in this space
    • regexp syntax is not unique to R
    • R does have a few unique “quirks” though
  • Will progress to categorical data \(\implies\) factors

Text Manipulation

  • The package stringr contains functions for text manipulation
  • Key Functions:
    • str_detect()
    • str_remove()
    • str_extract()
    • str_replace()
  • Alternatives to grepl(), grep(), gsub() etc. from base

stringr::str_detect()

  • str_detect() returns a logical vector \(\implies\) same length as the input vector
## Return a logical vector where 'x' matches the pattern 'Mat'
str_detect(string = x, pattern = "Mat")
  • How would we search for either “Mat” or “Mot”?
  • We can pass alternative sets of letters in square brackets []
## Use the alternative letters a|o between the M & t
str_detect(string = x, pattern = "M[ao]t")

stringr::str_detect()

  • When using regular expressions . functions as a wildcard
    • * has a different meaning
## Return a logical vector where 'x' matches the pattern 'M?t'
str_detect(string = x, pattern = "M.t")

stringr::str_remove()

  • To remove words or patterns
str_remove(x, "M")
str_remove(x, "Hi ")
str_remove(x, " ")
  • Why did that last one only remove the first space?

stringr::str_remove_all()

  • str_remove() will only remove the first match
  • str_remove_all() will remove all matches
str_remove_all(x, " ")
  • Using regex syntax \(\implies\) pass sets of letters using [] syntax
## Remove all vowels
str_remove_all(x, "[aeiou]")
## Remove only the first vowel
str_remove(x, "[aeiou]")

stringr::str_remove_all()

  • These allow us to specify wild cards more carefully
str_remove_all(x, "M[aeiou]t")
str_remove_all(x, "[ae]r")
  • Beyond removing prefixes/suffixes, removing strings has limited use
  • str_extract() can be more useful
str_extract(x, "H[aeiou]")
str_extract(x, "H[a-z]")
  • These look very similar \(\implies\) second opens possibilities

stringr::str_extract()

  • In regular expressions we can extend a match using +
    \(\implies\) match one or more characters
str_extract(x, "H[a-z]+")
  • This will match the first H and then all following lower-case letters
    • The match stops at the whitespace \(\implies\) not in the set [a-z]

stringr::str_extract()

  • Repeat the match starting with “M”
str_extract(x, "M[a-z]+")
  • If we change the lower-case set to a . \(\implies\) match anything
str_extract(x, "M.+")

stringr::str_extract()

  • We can also specify the exact number of times for a match
str_extract(x, "H.+ M[a-z]{2}")
str_extract(x, "H.+ M[a-z]{3}")
  • The second pattern expects a 4 letter word starting with M
    • Returns NA if no match

stringr::str_extract_all()

  • str_extract() will only return the first match
str_extract(x, "[Hh].")
  • str_extract_all() returns all matches
str_extract_all(x, "[Hh].")
  • Note that now we have a list of the same length as x
    • Each element contains all matches within the initial string

Regular Expressions

Regular Expressions

  • We’re mostly familiar with matching words
    • regex allows more powerful matching
  • Can include wildcards (.)
  • Can specify sets of values ([a-z])
    • [A-Z] for upper-case
    • [0-9] for numbers
    • [:alnum:] represents all alpha-numeric characters
  • Can extend matches using + for one or more
    • None or more can be specified using *

Regular Expressions

  • In logical testing, the symbol | means OR
  • Can also be incorporated into patterns
## Extract the pattern 'Hi' or 'Hello'
str_extract(x, "Hi|Hello")
## Extract the pattern 'Mum', 'Mother' or 'Maternal'
str_extract(x, "Mum|Mother|Maternal")

Regular Expressions

  • Also allows for patterns to be anchored
    • ^ anchors a match to the start
    • $ anchors a match to the end
## Extract the first word
str_extract(x, "^[:alnum:]+")
## Or the last word
str_extract(x, "[:alnum:]+$")

Regular Expressions

  • To thoroughly confuse everyone: ^ has a second meaning
    • Inside [] it means not
## Match a pattern at the end which doesn't contain a space
str_extract(x, "[^ ]+$")
  • Regular expressions are fun to write
    \(\implies\)horrible to read back!

stringr::str_view()

  • We can check our matches in detail using str_view()
str_view(x, "r")
str_view(x, "r$")
str_view(x, "M[aeiou]t")
str_view(x, "^[^ ]+", match = NA, html = TRUE)

stringr::str_replace()

  • str_replace() is used for extracting/modifying text strings
    • Even more powerful than str_extract()
str_replace(x, pattern = "Mum", replacement = "Dad")
  1. Searching the string “Hi Mum” for the pattern “Mum”, and
  2. Replacing the first instance of “Mum” with “Dad”

stringr::str_replace()

  • Wildcards and character sets work in the exact same manner
str_replace(x, "M[a-z]", "Da")
str_replace(x, "M.{2}", "Dad")
str_replace(x, "M.+", "Dad")

stringr::str_replace()

  • The use of capturing patterns makes this extremely flexible
  • We can capture words/phrases/patterns using (pattern) inside braces
    • Captured patterns are able to be returned in numeric order of capture
    • In the following, we capture only one pattern
str_replace(x, "H.+ (.+)", "\\1")
  • Now let’s capture two patterns
str_replace(x, "(H.+) (M.+)", "\\2! \\1!")

stringr::str_replace()

  • str_replace() only replaces the first match in a string
  • str_replace_all() replaces all matches
str_replace(x, "[Mm]", "b")
str_replace_all(x, "[Mm]", "b")
str_replace_all(x, "[a-z]", "*")

Brief Summary

  • str_detect() \(\implies\) logical vector
  • str_remove() / str_remove_all() \(\implies\) remove matching patterns
  • str_extract() \(\implies\) extract matching patterns
  • str_replace() / str_replace_all() \(\implies\) modify a character vector

Brief Summary

  • All regex based operations
  • Can all be piped
c("M", "F", "MAle", "Female") |> str_extract("^[MF]")
  • Regular Expressions are very powerful
  • Horrible to read
  • Exist in all languages \(\implies\) not specific to R

Additional Functions

More Helpful Functions

str_count(x, "[Mm]")
str_length(x)
str_to_lower(x)
str_to_upper(x)
str_split_fixed(x, pattern = " ", n = 2)
str_wrap(x, width = 8)
str_starts(x, "Hi")
str_ends(x, "[rt]")
str_flatten(x, collapse = "; ")
str_trunc(x, width = 7)
str_to_title("a bad example")

More Helpful Functions

  • Pseudo-numeric strings are also handled well
str_pad(c("1", "10", "100"), width = 3, pad = "0")
str_sort(c("1", "10", "2"))
str_sort(c("1", "10", "2"), numeric = TRUE)

Additional Tools and Tricks

  • The function paste() is a very useful one
    • The default separator is " "
    • paste0() has the default separator as ""
paste(x, "How are you?")
paste(x, "How are you?", sep = ". ")
paste0(x, "!")
paste(x, collapse = "! ")

Additional Tools and Tricks

  • The package glue has revolutionised text manipulation
    • We can pass R objects or function calls to the middle of a text string
    • We do need to be careful with quotation marks here
library(glue)
glue("When they answered, I said '{x}!'")
glue("I call them {str_remove(x, 'H.+ ')}")
glue_collapse(letters, sep = ", ", last = " & ")
  • Output is of class glue
    • Coerces back to character
    • Plays very well with advanced tidyverse syntax (e.g. rlang)

Working With Strings

  • Is an incredibly common and important part of working with R
  • Extract sample IDs from file names
  • Pull key information from columns
  • Remove prefixes/suffixes
  • Correct data entry errors
  • Format for pretty plotting

Challenges: Slide 1

  1. Given a vector of Transcript IDs with versions, remove the version number?
ids <- c("ENST00000376207.10", "ENST00000376199.7")
  1. Add the ‘chr’ prefix to these chromosomes
chr <- c(1:22, "X", "Y", "M")
  1. Pull the chromosome out of these cytogenetic bands
cyto <- c("Xp11.23", "11q2.3", "2p7.1")
  1. Change these phone numbers to start with +61 instead of 0
phones <- c("0499123456", "0498760432")

Challenges: Slide 2

  1. Remove the suffix “.bam” from these filenames
bams <- c("rna_bamboo1.bam", "rna_rice1.bam", "rna_wheat1.bam")
  1. Correct the responses to be consistent (choose the format)
response <- c("Y", "yes", "No", "no")
  1. Correct these recorded values to be consistently M/F or Male/Female
sex <- c("M", "male", "femal", "Female")

Factors

Factors

A common data type in statistics is a categorical variable (i.e. a factor)

  • Can appear to be a character vector/column
    • Can easily trip an unsuspecting data scientist up
  • Data will be a set of common groups/categories
pet_vec <- c("Dog", "Dog", "Cat", "Dog", "Bird", "Bird")
  • This is a character vector

Factors

  • We can simply coerce this to a vector of factors
  • Categories will automatically be assigned alphabetically using as.factor()
pet_factors <- as.factor(pet_vec)
pet_factors

We can manually set these categories as levels using factor()

pet_factors <- factor(pet_vec, levels = c("Dog", "Cat", "Bird"))
pet_factors

Factors

  • These are actually stored as integers
  • Each integer corresponds to a level
str(pet_factors)
as.integer(pet_factors)
as.character(pet_factors)

The package forcats

  • forcats is a part of the core tidyverse
    • Specifically for wrangling factors
    • Also plays very nicely with stringr
  • as.factor() and factor(levels = ...) are base functions
  • Most forcats functions start with fct_ or use _
  • as_factor() parallels as.factor()
    • But uses the order of appearance, not alpha-numeric sorting
  • fct() replicates factor() with stricter error handling

Some Handy Tricks

  • fct_inorder() sets categories in the order they appear
    • Sort your data.frame then apply fct_inorder() for nice structured plots
fct_inorder(pet_vec)
  • fct_infreq() sets categories by their frequency
fct_infreq(pet_vec)

Some Handy Tricks

  • Collapse categories with fewer than n entries
fct_lump_min(pet_vec, min = 3)
  • Collapse categories with fewer than p entries
fct_lump_prop(pet_vec, prop = 0.3)
  • Reverse the order (will automatically coerce to a factor)
fct_rev(pet_vec)

Some Handy Tricks

  • Relabelling factors can take advantage of stringr
pet_vec |>
  as_factor() |>
  fct_relabel(.fun = str_to_lower)
  • We’ll learn more about tailoring the functions (i.e. inline functions) soon

Some Handy Tricks

  • We can combine multiple factors using fct_cross()
sz <- c("big", "small", "small", "big", "small", "tiny") |>
  factor(levels = c("tiny", "small", "big"))
pet_vec |>
  as_factor() |>
  fct_cross(sz)