Working With Text

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 9, 2025

Text Strings

Text Manipulation

Start a new R script: text.R

library(tidyverse)

Next, create the vector we’ll mess around with

## Create a character vector for this session
x <- c("Hi Mum", "Hi Mother", "Hello Maternal Parent")

Text Manipulation

Working with character vectors
One of the most common and regular tasks
- Cleaning up column names
- Cleaning up data
- Tidying up text on plots

Particularly relevant when working with IDs
- May appear as Run 1 in one file and Run_001 in another
Data providers often have their ID formats and edit the ones provided
- Need to identify & extract/modify

Regular Expressions

We’re mostly familiar with words
- Regular Expressions (regexp) are incredibly powerful tools in this space
- regexp syntax is not unique to R
- R does have a few unique “quirks” though

Will progress to categorical data $\implies$ factors

Text Manipulation

The package stringr contains functions for text manipulation
Key Functions:
- str_detect()
- str_remove()
- str_extract()
- str_replace()
Alternatives to grepl(), grep(), gsub() etc. from base

`stringr::str_detect()`

str_detect() returns a logical vector $\implies$ same length as the input vector

## Return a logical vector where 'x' matches the pattern 'Mat'
str_detect(string = x, pattern = "Mat")

How would we search for either “Mat” or “Mot”?
We can pass alternative sets of letters in square brackets []

## Use the alternative letters a|o between the M & t
str_detect(string = x, pattern = "M[ao]t")

`stringr::str_detect()`

When using regular expressions . functions as a wildcard
- * has a different meaning

## Return a logical vector where 'x' matches the pattern 'M?t'
str_detect(string = x, pattern = "M.t")

`stringr::str_remove()`

To remove words or patterns

str_remove(x, "M")
str_remove(x, "Hi ")
str_remove(x, " ")

Why did that last one only remove the first space?

`stringr::str_remove_all()`

str_remove() will only remove the first match
str_remove_all() will remove all matches

str_remove_all(x, " ")

Using regex syntax $\implies$ pass sets of letters using [] syntax

## Remove all vowels
str_remove_all(x, "[aeiou]")
## Remove only the first vowel
str_remove(x, "[aeiou]")

`stringr::str_remove_all()`

These allow us to specify wild cards more carefully

str_remove_all(x, "M[aeiou]t")
str_remove_all(x, "[ae]r")

Beyond removing prefixes/suffixes, removing strings has limited use
str_extract() can be more useful

str_extract(x, "H[aeiou]")
str_extract(x, "H[a-z]")

These look very similar $\implies$ second opens possibilities

`stringr::str_extract()`

In regular expressions we can extend a match using +
$\implies$ match one or more characters

str_extract(x, "H[a-z]+")

This will match the first H and then all following lower-case letters
- The match stops at the whitespace $\implies$ not in the set [a-z]

`stringr::str_extract()`

Repeat the match starting with “M”

str_extract(x, "M[a-z]+")

If we change the lower-case set to a . $\implies$ match anything

str_extract(x, "M.+")

`stringr::str_extract()`

We can also specify the exact number of times for a match

str_extract(x, "H.+ M[a-z]{2}")
str_extract(x, "H.+ M[a-z]{3}")

The second pattern expects a 4 letter word starting with M
- Returns NA if no match

`stringr::str_extract_all()`

str_extract() will only return the first match

str_extract(x, "[Hh].")

str_extract_all() returns all matches

str_extract_all(x, "[Hh].")

Note that now we have a list of the same length as x
- Each element contains all matches within the initial string

Regular Expressions

We’re mostly familiar with matching words
- regex allows more powerful matching
Can include wildcards (.)
Can specify sets of values ([a-z])
- [A-Z] for upper-case
- [0-9] for numbers
- [:alnum:] represents all alpha-numeric characters
Can extend matches using + for one or more
- None or more can be specified using *

Regular Expressions

In logical testing, the symbol | means OR
Can also be incorporated into patterns

## Extract the pattern 'Hi' or 'Hello'
str_extract(x, "Hi|Hello")
## Extract the pattern 'Mum', 'Mother' or 'Maternal'
str_extract(x, "Mum|Mother|Maternal")

Regular Expressions

Also allows for patterns to be anchored
- ^ anchors a match to the start
- $ anchors a match to the end

## Extract the first word
str_extract(x, "^[:alnum:]+")
## Or the last word
str_extract(x, "[:alnum:]+$")

Regular Expressions

To thoroughly confuse everyone: ^ has a second meaning
- Inside [] it means not

## Match a pattern at the end which doesn't contain a space
str_extract(x, "[^ ]+$")

Regular expressions are fun to write
$\implies$horrible to read back!

`stringr::str_view()`

We can check our matches in detail using str_view()

str_view(x, "r")
str_view(x, "r$")
str_view(x, "M[aeiou]t")
str_view(x, "^[^ ]+", match = NA, html = TRUE)

`stringr::str_replace()`

str_replace() is used for extracting/modifying text strings
- Even more powerful than str_extract()

str_replace(x, pattern = "Mum", replacement = "Dad")

Searching the string “Hi Mum” for the pattern “Mum”, and
Replacing the first instance of “Mum” with “Dad”

`stringr::str_replace()`

Wildcards and character sets work in the exact same manner

str_replace(x, "M[a-z]", "Da")
str_replace(x, "M.{2}", "Dad")
str_replace(x, "M.+", "Dad")

`stringr::str_replace()`

The use of capturing patterns makes this extremely flexible
We can capture words/phrases/patterns using (pattern) inside braces
- Captured patterns are able to be returned in numeric order of capture
- In the following, we capture only one pattern

str_replace(x, "H.+ (.+)", "\\1")

Now let’s capture two patterns

str_replace(x, "(H.+) (M.+)", "\\2! \\1!")

`stringr::str_replace()`

str_replace() only replaces the first match in a string
str_replace_all() replaces all matches

str_replace(x, "[Mm]", "b")
str_replace_all(x, "[Mm]", "b")
str_replace_all(x, "[a-z]", "*")

Brief Summary

str_detect() $\implies$ logical vector
str_remove() / str_remove_all() $\implies$ remove matching patterns
str_extract() $\implies$ extract matching patterns
str_replace() / str_replace_all() $\implies$ modify a character vector

Brief Summary

All regex based operations
Can all be piped

c("M", "F", "MAle", "Female") |> str_extract("^[MF]")

Regular Expressions are very powerful
Horrible to read
Exist in all languages $\implies$ not specific to R

Additional Functions

More Helpful Functions

str_count(x, "[Mm]")
str_length(x)
str_to_lower(x)
str_to_upper(x)
str_split_fixed(x, pattern = " ", n = 2)
str_wrap(x, width = 8)
str_starts(x, "Hi")
str_ends(x, "[rt]")
str_flatten(x, collapse = "; ")
str_trunc(x, width = 7)
str_to_title("a bad example")

More Helpful Functions

Pseudo-numeric strings are also handled well

str_pad(c("1", "10", "100"), width = 3, pad = "0")
str_sort(c("1", "10", "2"))
str_sort(c("1", "10", "2"), numeric = TRUE)

Additional Tools and Tricks

The function paste() is a very useful one
- The default separator is " "
- paste0() has the default separator as ""

paste(x, "How are you?")
paste(x, "How are you?", sep = ". ")
paste0(x, "!")
paste(x, collapse = "! ")

Additional Tools and Tricks

The package glue has revolutionised text manipulation
- We can pass R objects or function calls to the middle of a text string
- We do need to be careful with quotation marks here

library(glue)
glue("When they answered, I said '{x}!'")
glue("I call them {str_remove(x, 'H.+ ')}")
glue_collapse(letters, sep = ", ", last = " & ")

Output is of class glue
- Coerces back to character
- Plays very well with advanced tidyverse syntax (e.g. rlang)

Working With Strings

Is an incredibly common and important part of working with R
Extract sample IDs from file names
Pull key information from columns
Remove prefixes/suffixes
Correct data entry errors
Format for pretty plotting

Challenges: Slide 1

Given a vector of Transcript IDs with versions, remove the version number?

ids <- c("ENST00000376207.10", "ENST00000376199.7")

Add the ‘chr’ prefix to these chromosomes

chr <- c(1:22, "X", "Y", "M")

Pull the chromosome out of these cytogenetic bands

cyto <- c("Xp11.23", "11q2.3", "2p7.1")

Change these phone numbers to start with +61 instead of 0

phones <- c("0499123456", "0498760432")

Challenges: Slide 2

Remove the suffix “.bam” from these filenames

bams <- c("rna_bamboo1.bam", "rna_rice1.bam", "rna_wheat1.bam")

Correct the responses to be consistent (choose the format)

response <- c("Y", "yes", "No", "no")

Correct these recorded values to be consistently M/F or Male/Female

sex <- c("M", "male", "femal", "Female")

Factors

A common data type in statistics is a categorical variable (i.e. a factor)

Can appear to be a character vector/column
- Can easily trip an unsuspecting data scientist up
Data will be a set of common groups/categories

pet_vec <- c("Dog", "Dog", "Cat", "Dog", "Bird", "Bird")

This is a character vector

Factors

We can simply coerce this to a vector of factors
Categories will automatically be assigned alphabetically using as.factor()

pet_factors <- as.factor(pet_vec)
pet_factors

We can manually set these categories as levels using factor()

pet_factors <- factor(pet_vec, levels = c("Dog", "Cat", "Bird"))
pet_factors

Factors

These are actually stored as integers
Each integer corresponds to a level

str(pet_factors)
as.integer(pet_factors)
as.character(pet_factors)

The package `forcats`

forcats is a part of the core tidyverse
- Specifically for wrangling factors
- Also plays very nicely with stringr

as.factor() and factor(levels = ...) are base functions
Most forcats functions start with fct_ or use _

as_factor() parallels as.factor()
- But uses the order of appearance, not alpha-numeric sorting
fct() replicates factor() with stricter error handling

Some Handy Tricks

fct_inorder() sets categories in the order they appear
- Sort your data.frame then apply fct_inorder() for nice structured plots

fct_inorder(pet_vec)

fct_infreq() sets categories by their frequency

fct_infreq(pet_vec)

Some Handy Tricks

Collapse categories with fewer than n entries

fct_lump_min(pet_vec, min = 3)

Collapse categories with fewer than p entries

fct_lump_prop(pet_vec, prop = 0.3)

Reverse the order (will automatically coerce to a factor)

fct_rev(pet_vec)

Some Handy Tricks

Relabelling factors can take advantage of stringr

pet_vec |>
  as_factor() |>
  fct_relabel(.fun = str_to_lower)

We’ll learn more about tailoring the functions (i.e. inline functions) soon

Some Handy Tricks

We can combine multiple factors using fct_cross()

sz <- c("big", "small", "small", "big", "small", "tiny") |>
  factor(levels = c("tiny", "small", "big"))
pet_vec |>
  as_factor() |>
  fct_cross(sz)

Working With Text

Text Strings

Text Manipulation

Text Manipulation

Regular Expressions

Text Manipulation

stringr::str_detect()

stringr::str_detect()

stringr::str_remove()

stringr::str_remove_all()

stringr::str_remove_all()

stringr::str_extract()

stringr::str_extract()

stringr::str_extract()

stringr::str_extract_all()

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

stringr::str_view()

stringr::str_replace()

stringr::str_replace()

stringr::str_replace()

stringr::str_replace()

Brief Summary

Brief Summary

Additional Functions

More Helpful Functions

More Helpful Functions

Additional Tools and Tricks

Additional Tools and Tricks

Working With Strings

Challenges: Slide 1

Challenges: Slide 2

Factors

Factors

Factors

Factors

The package forcats

Some Handy Tricks

Some Handy Tricks

Some Handy Tricks

Some Handy Tricks

`stringr::str_detect()`

`stringr::str_detect()`

`stringr::str_remove()`

`stringr::str_remove_all()`

`stringr::str_remove_all()`

`stringr::str_extract()`

`stringr::str_extract()`

`stringr::str_extract()`

`stringr::str_extract_all()`

`stringr::str_view()`

`stringr::str_replace()`

`stringr::str_replace()`

`stringr::str_replace()`

`stringr::str_replace()`

The package `forcats`