Working With Text

RAdelaide 2024

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 9, 2024

Text Strings

Text Manipulation

  • Start a new R script: text.R
library(tidyverse)
  • Next, create the vector we’ll mess around with
x <- c("Hi Mum", "Hi Mother", "Hello Parent1")

Text Manipulation

  • Working with character vectors
  • One of the most common and regular tasks
    • Cleaning up column names
    • Cleaning up data
    • Tidying up text on plots
  • Particularly relevant when working with IDs
    • May appear as Run 1 in one file and Run_001 in another
  • Data providers often have their ID formats and edit the ones provided
    • Need to identify & extract/modify

Regular Expressions

  • We’re mostly familiar with words
    • Regular Expressions (regexp) are incredibly powerful tools in this space
    • regexp syntax is not unique to R
    • R does have a few unique “quirks” though
  • Will progress to categorical data \(\implies\) factors

Text Manipulation

  • The package stringr contains functions for text manipulation
  • Key Functions:
    • str_detect()
    • str_remove()
    • str_extract()
    • str_replace()
  • Alternatives to grepl(), grep(), gsub() etc. from base

stringr::str_detect()

  • str_detect() returns a logical vector
    • Same length as the input vector
str_detect(string = x, pattern = "Mum")

stringr::str_detect()

  • In the above we used basic words for the search pattern
  • Regular expressions use a . as a wild card
    • * has different meaning to many other contexts
str_detect(x, "Mu")
str_detect(x, "M.")
  • The wild-card value . obviously needed to follow M in this search

stringr::str_detect()

  • Alternative individual characters can be specified within []
str_detect(x, "M[ou]")
  • Here either an o or u needed to follow the M
  • Alternative sequences of letters can also be specified using |
    • | is common across many languages for OR
str_detect(x, "Mother|Parent")

stringr::str_detect()

  • We can also anchor patterns to:
    • The start of a stringr (^)
    • The end of a string ($)
str_detect(x, "r")
str_detect(x, "r$")
str_detect(x, "^Hi ")

stringr::str_view()

  • We can check our matches in detail using str_view()
str_view(x, "r")
str_view(x, "r$")
str_view(x, "M[ou]")
str_view(x, "M[ou]", match = NA, html = TRUE)

stringr::str_extract()

  • We can use str_extract() to extract patterns
str_extract(string = x, pattern = "Hi")
  • The same approach to alternative sequences will also work
str_extract(x, "Mother|Parent")
  • If no pattern is found NA is returned
    • This can be helpful if no matches are found

stringr::str_extract()

  • Wildcards can also be used
str_extract(x, "Hi M.")
str_extract(x, "Hi [MP].")
  • We can be extremely flexible with characters in the []
    • In the following, we’re asking for any upper-case letter
str_extract(x, "Hi [A-Z].")

stringr::str_extract()

  • We can extend the wildcard using +
    • This means ’repeat the previous pattern 1 or more times`
    • .` is a wildcard \(\implies\) repeat the wildcard (not the match) 1 or more times
str_extract(x, "Hi [A-Z].")
str_extract(x, "H.+ [A-Z].")
  • We could restrict the wildcard to be only lower-case letters
str_extract(x, "H.+ [A-Z][a-z]+")

stringr::str_extract()

  • Or just be even more general using [:alpha:]
str_extract(x, "H.+ [:alpha:]+")
  • The complete list of classes is at ?base::regex
  • We can also specify the exact number of times for a match
str_extract(x, "H.+ [:alpha:]{3}")
str_extract(x, "H.+ [:alpha:]{4}")

stringr::str_extract_all()

  • str_extract() will only return the first match
str_extract(x, "[Hh].")
  • str_extract_all() returns all matches
str_extract_all(x, "[Hh].")
  • Note that now we have a list of the same length as x
    • Each element contains all matches within the initial string

stringr::str_remove()

  • We can use the above tricks to remove patterns within our text
str_remove(x, "Hi ")
str_remove(x, "[aeiou]")
  • Notice that only the first match was removed
    • str_remove_all() will remove all occurences
str_remove_all(x, "[aeiou]")

Very useful for removing file suffixes etc

stringr::str_replace()

  • str_replace() is used for extracting/modifying text strings
    • Even more powerful than str_extract()
str_replace(x, pattern = "Mum", replacement = "Dad")
  1. Searching the string “Hi Mum” for the pattern “Mum”, and
  2. Replacing the first instance of “Mum” with “Dad”

stringr::str_replace()

  • Wildcards and character sets work in the exact same manner
str_replace(x, "M[a-z]", "Da")
str_replace(x, "[MP].{2}", "Dad")
str_replace(x, "[MP].+", "Dad")

stringr::str_replace()

  • The use of capturing patterns makes this extremely flexible
  • We can capture words/phrases/patterns using (pattern)
    • Captured patterns are able to be returned in numeric order of capture
    • In the following, we capture only one pattern
str_replace(x, "H.+ (.+)", "\\1")
  • Now let’s capture two patterns
str_replace(x, "(H.+) ([:alpha:]+)", "\\2! \\1!")

stringr::str_replace()

  • To tidy up the final digit use *
    • Like + except the match is zero or more times
str_replace(x, "(H.+) ([:alpha:]+)[0-9]*", "\\2! \\1!")

stringr::str_replace()

  • str_replace() only replaces the first match in a string
  • str_replace_all() replaces all matches
str_replace(x, "[Mm]", "b")
str_replace_all(x, "[Mm]", "b")

More Helpful Functions

str_remove(x, "Hi ")
str_count(x, "[Mm]")
str_length(x)
str_to_lower(x)
str_to_upper(x)
str_split_fixed(x, pattern = " ", n = 2)
str_wrap(x, width = 8)
str_starts(x, "Hi")
str_ends(x, "[:alpha:]")
str_flatten(x, collapse = "; ")
str_trunc(x, width = 7)
str_to_title("a bad example")
  • Pseudo-numeric strings are also handled well
str_pad(c("1", "10", "100"), width = 3, pad = "0")
str_sort(c("1", "10", "2"))
str_sort(c("1", "10", "2"), numeric = TRUE)

Additional Tools and Tricks

  • The function paste() is a very useful one
    • The default separator is " "
    • paste0() has the default separator as ""
paste(x, "How are you?")
paste(x, "How are you?", sep = ". ")
paste0(x, "!")
paste(x, collapse = "! ")

Additional Tools and Tricks

  • The package glue has revolutionised text manipulation
    • We can pass R objects or function calls to the middle of a text string
    • We do need to be careful with quotation marks here
library(glue)
glue("When they answered, I said '{x}!'")
glue("I call them {str_remove(x, 'H.+ ')}")
glue_collapse(letters, sep = ", ", last = " & ")
  • Output is of class glue
    • Coerces back to character
    • Plays very well with advanced tidyverse syntax (e.g. rlang)

Working With Strings

  • Is an incredibly common and important part of working with R
  • Extract sample IDs from file names
  • Pull key information from columns
  • Remove prefixes/suffixes
  • Correct data entry errors
  • Format for pretty plotting

Factors

Factors

A common data type in statistics is a categorical variable (i.e. a factor)

  • Can appear to be a character vector/column
    • Can easily trip an unsuspecting analyst up
  • Data will be a set of common groups/categories
pet_vec <- c("Dog", "Dog", "Cat", "Dog", "Bird", "Bird")
  • This is a character vector

Factors

  • We can simply coerce this to a vector of factors
  • Categories will automatically be assigned alphabetically using as.factor()
pet_factors <- as.factor(pet_vec)
pet_factors

We can manually set these categories as levels using factor()

pet_factors <- factor(pet_vec, levels = c("Dog", "Cat", "Bird"))
pet_factors

Factors

  • These are actually stored as integers
  • Each integer corresponds to a level
str(pet_factors)
as.integer(pet_factors)
as.character(pet_factors)

The package forcats

  • forcats is a part of the core tidyverse
    • Specifically for wrangling factors
    • Also plays very nicely with stringr

Some Handy Tricks

  • fct_inorder() sets categories in the order they appear
    • Sort your data.frame then apply fct_inorder() for nice structured plots
fct_inorder(pet_vec)
  • fct_infreq() sets categories by their frequency
fct_infreq(pet_vec)

Some Handy Tricks

  • Collapse categories with fewer than n entries
fct_lump_min(pet_vec, min = 3)
  • Collapse categories with fewer than p entries
fct_lump_prop(pet_vec, prop = 0.3)
  • Reverse the order (will automatically coerce to a factor)
fct_rev(pet_vec)