Working With Text

RAdelaide 2025

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

Published

July 9, 2025

Regular Expressions

Regular Expressions

  • We’re mostly familiar with matching words
    • regex allows more powerful matching
  • Can include wildcards (.)
  • Can specify sets of values ([a-z])
    • [A-Z] for upper-case
    • [0-9] for numbers
    • [:alnum:] represents all alpha-numeric characters
  • Can extend matches using + for one or more
    • None or more can be specified using *

Regular Expressions

  • In logical testing, the symbol | means OR
  • Can also be incorporated into patterns
## Extract the pattern 'Hi' or 'Hello'
str_extract(x, "Hi|Hello")
## Extract the pattern 'Mum', 'Mother' or 'Maternal'
str_extract(x, "Mum|Mother|Maternal")

Regular Expressions

  • Also allows for patterns to be anchored
    • ^ anchors a match to the start
    • $ anchors a match to the end
## Extract the first word
str_extract(x, "^[:alnum:]+")
## Or the last word
str_extract(x, "[:alnum:]+$")

Regular Expressions

  • To thoroughly confuse everyone: ^ has a second meaning
    • Inside [] it means not
## Match a pattern at the end which doesn't contain a space
str_extract(x, "[^ ]+$")
  • Regular expressions are fun to write
    \(\implies\)horrible to read back!

stringr::str_view()

  • We can check our matches in detail using str_view()
str_view(x, "r")
str_view(x, "r$")
str_view(x, "M[aeiou]t")
str_view(x, "^[^ ]+", match = NA, html = TRUE)

stringr::str_replace()

  • str_replace() is used for extracting/modifying text strings
    • Even more powerful than str_extract()
str_replace(x, pattern = "Mum", replacement = "Dad")
  1. Searching the string “Hi Mum” for the pattern “Mum”, and
  2. Replacing the first instance of “Mum” with “Dad”

stringr::str_replace()

  • Wildcards and character sets work in the exact same manner
str_replace(x, "M[a-z]", "Da")
str_replace(x, "M.{2}", "Dad")
str_replace(x, "M.+", "Dad")

stringr::str_replace()

  • The use of capturing patterns makes this extremely flexible
  • We can capture words/phrases/patterns using (pattern) inside braces
    • Captured patterns are able to be returned in numeric order of capture
    • In the following, we capture only one pattern
str_replace(x, "H.+ (.+)", "\\1")
  • Now let’s capture two patterns
str_replace(x, "(H.+) (M.+)", "\\2! \\1!")

stringr::str_replace()

  • str_replace() only replaces the first match in a string
  • str_replace_all() replaces all matches
str_replace(x, "[Mm]", "b")
str_replace_all(x, "[Mm]", "b")
str_replace_all(x, "[a-z]", "*")

Brief Summary

  • str_detect() \(\implies\) logical vector
  • str_remove() / str_remove_all() \(\implies\) remove matching patterns
  • str_extract() \(\implies\) extract matching patterns
  • str_replace() / str_replace_all() \(\implies\) modify a character vector

Brief Summary

  • All regex based operations
  • Can all be piped
c("M", "F", "MAle", "Female") |> str_extract("^[MF]")
  • Regular Expressions are very powerful
  • Horrible to read
  • Exist in all languages \(\implies\) not specific to R