Working With Text

RAdelaide 2024

Author

Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 9, 2024

Text Strings

Text Manipulation

Start a new R script: text.R

library(tidyverse)

Next, create the vector we’ll mess around with

x <- c("Hi Mum", "Hi Mother", "Hello Parent1")

Text Manipulation

Working with character vectors
One of the most common and regular tasks
- Cleaning up column names
- Cleaning up data
- Tidying up text on plots

. . .

Particularly relevant when working with IDs
- May appear as Run 1 in one file and Run_001 in another
Data providers often have their ID formats and edit the ones provided
- Need to identify & extract/modify

Regular Expressions

We’re mostly familiar with words
- Regular Expressions (regexp) are incredibly powerful tools in this space
- regexp syntax is not unique to R
- R does have a few unique “quirks” though

. . .

Will progress to categorical data $\implies$ factors

Text Manipulation

The package stringr contains functions for text manipulation
Key Functions:
- str_detect()
- str_remove()
- str_extract()
- str_replace()
Alternatives to grepl(), grep(), gsub() etc. from base

`stringr::str_detect()`

str_detect() returns a logical vector
- Same length as the input vector

str_detect(string = x, pattern = "Mum")

`stringr::str_detect()`

In the above we used basic words for the search pattern
Regular expressions use a . as a wild card
- * has different meaning to many other contexts

str_detect(x, "Mu")
str_detect(x, "M.")

. . .

The wild-card value . obviously needed to follow M in this search

`stringr::str_detect()`

Alternative individual characters can be specified within []

str_detect(x, "M[ou]")

Here either an o or u needed to follow the M

. . .

Alternative sequences of letters can also be specified using |
- | is common across many languages for OR

str_detect(x, "Mother|Parent")

`stringr::str_detect()`

We can also anchor patterns to:
- The start of a stringr (^)
- The end of a string ($)

str_detect(x, "r")
str_detect(x, "r$")
str_detect(x, "^Hi ")

`stringr::str_view()`

We can check our matches in detail using str_view()

str_view(x, "r")
str_view(x, "r$")
str_view(x, "M[ou]")
str_view(x, "M[ou]", match = NA, html = TRUE)

`stringr::str_extract()`

We can use str_extract() to extract patterns

str_extract(string = x, pattern = "Hi")

. . .

The same approach to alternative sequences will also work

str_extract(x, "Mother|Parent")

If no pattern is found NA is returned
- This can be helpful if no matches are found

`stringr::str_extract()`

Wildcards can also be used

str_extract(x, "Hi M.")
str_extract(x, "Hi [MP].")

. . .

We can be extremely flexible with characters in the []
- In the following, we’re asking for any upper-case letter

str_extract(x, "Hi [A-Z].")

`stringr::str_extract()`

We can extend the wildcard using +
- This means ’repeat the previous pattern 1 or more times`
- .` is a wildcard $\implies$ repeat the wildcard (not the match) 1 or more times

str_extract(x, "Hi [A-Z].")
str_extract(x, "H.+ [A-Z].")

. . .

We could restrict the wildcard to be only lower-case letters

str_extract(x, "H.+ [A-Z][a-z]+")

`stringr::str_extract()`

Or just be even more general using [:alpha:]

str_extract(x, "H.+ [:alpha:]+")

The complete list of classes is at ?base::regex

. . .

We can also specify the exact number of times for a match

str_extract(x, "H.+ [:alpha:]{3}")
str_extract(x, "H.+ [:alpha:]{4}")

`stringr::str_extract_all()`

str_extract() will only return the first match

str_extract(x, "[Hh].")

. . .

str_extract_all() returns all matches

str_extract_all(x, "[Hh].")

. . .

Note that now we have a list of the same length as x
- Each element contains all matches within the initial string

Will talk in detail about lists tomorrow

`stringr::str_remove()`

We can use the above tricks to remove patterns within our text

str_remove(x, "Hi ")
str_remove(x, "[aeiou]")

. . .

Notice that only the first match was removed
- str_remove_all() will remove all occurences

str_remove_all(x, "[aeiou]")

. . .

Very useful for removing file suffixes etc

`stringr::str_replace()`

str_replace() is used for extracting/modifying text strings
- Even more powerful than str_extract()

str_replace(x, pattern = "Mum", replacement = "Dad")

Searching the string “Hi Mum” for the pattern “Mum”, and
Replacing the first instance of “Mum” with “Dad”

`stringr::str_replace()`

Wildcards and character sets work in the exact same manner

str_replace(x, "M[a-z]", "Da")
str_replace(x, "[MP].{2}", "Dad")
str_replace(x, "[MP].+", "Dad")

`stringr::str_replace()`

The use of capturing patterns makes this extremely flexible
We can capture words/phrases/patterns using (pattern)
- Captured patterns are able to be returned in numeric order of capture
- In the following, we capture only one pattern

str_replace(x, "H.+ (.+)", "\\1")

. . .

Now let’s capture two patterns

str_replace(x, "(H.+) ([:alpha:]+)", "\\2! \\1!")

`stringr::str_replace()`

To tidy up the final digit use *
- Like + except the match is zero or more times

str_replace(x, "(H.+) ([:alpha:]+)[0-9]*", "\\2! \\1!")

`stringr::str_replace()`

str_replace() only replaces the first match in a string
str_replace_all() replaces all matches

str_replace(x, "[Mm]", "b")
str_replace_all(x, "[Mm]", "b")

More Helpful Functions

str_remove(x, "Hi ")
str_count(x, "[Mm]")
str_length(x)
str_to_lower(x)
str_to_upper(x)
str_split_fixed(x, pattern = " ", n = 2)
str_wrap(x, width = 8)
str_starts(x, "Hi")
str_ends(x, "[:alpha:]")
str_flatten(x, collapse = "; ")
str_trunc(x, width = 7)
str_to_title("a bad example")

. . .

Pseudo-numeric strings are also handled well

str_pad(c("1", "10", "100"), width = 3, pad = "0")
str_sort(c("1", "10", "2"))
str_sort(c("1", "10", "2"), numeric = TRUE)

Additional Tools and Tricks

The function paste() is a very useful one
- The default separator is " "
- paste0() has the default separator as ""

paste(x, "How are you?")
paste(x, "How are you?", sep = ". ")
paste0(x, "!")
paste(x, collapse = "! ")

Additional Tools and Tricks

The package glue has revolutionised text manipulation
- We can pass R objects or function calls to the middle of a text string
- We do need to be careful with quotation marks here

library(glue)
glue("When they answered, I said '{x}!'")
glue("I call them {str_remove(x, 'H.+ ')}")
glue_collapse(letters, sep = ", ", last = " & ")

. . .

Output is of class glue
- Coerces back to character
- Plays very well with advanced tidyverse syntax (e.g. rlang)

Working With Strings

Is an incredibly common and important part of working with R
Extract sample IDs from file names
Pull key information from columns
Remove prefixes/suffixes
Correct data entry errors
Format for pretty plotting

Factors

A common data type in statistics is a categorical variable (i.e. a factor)

Can appear to be a character vector/column
- Can easily trip an unsuspecting analyst up
Data will be a set of common groups/categories

pet_vec <- c("Dog", "Dog", "Cat", "Dog", "Bird", "Bird")

This is a character vector

Factors

We can simply coerce this to a vector of factors
Categories will automatically be assigned alphabetically using as.factor()

pet_factors <- as.factor(pet_vec)
pet_factors

We can manually set these categories as levels using factor()

pet_factors <- factor(pet_vec, levels = c("Dog", "Cat", "Bird"))
pet_factors

Factors

These are actually stored as integers
Each integer corresponds to a level

str(pet_factors)
as.integer(pet_factors)
as.character(pet_factors)

A Potential Pitfall

What would happen if we think a factor is a character, and we use it to select values from a vector/matrix/data.frame?

. . .

names(pet_vec) <- pet_vec
pet_vec
pet_vec[pet_factors]

. . .

read_csv() and other readr functions always parse text as a character
- Older versions of read.csv() parsed text to factors by default
- Changed with R $\geq$ v4.0.0
If I want a factor, I explicitly make a factor
- During statistical analysis character vectors are always coerced

The package `forcats`

forcats is a part of the core tidyverse
- Specifically for wrangling factors
- Also plays very nicely with stringr

Some Handy Tricks

fct_inorder() sets categories in the order they appear
- Sort your data.frame then apply fct_inorder() for nice structured plots

fct_inorder(pet_vec)

. . .

fct_infreq() sets categories by their frequency

fct_infreq(pet_vec)

Some Handy Tricks

Collapse categories with fewer than n entries

fct_lump_min(pet_vec, min = 3)

. . .

Collapse categories with fewer than p entries

fct_lump_prop(pet_vec, prop = 0.3)

. . .

Reverse the order (will automatically coerce to a factor)

fct_rev(pet_vec)

Text Strings

Text Manipulation

Text Manipulation

Regular Expressions

Text Manipulation

stringr::str_detect()

stringr::str_detect()

stringr::str_detect()

stringr::str_detect()

stringr::str_view()

stringr::str_extract()

stringr::str_extract()

stringr::str_extract()

stringr::str_extract()

stringr::str_extract_all()

stringr::str_remove()

stringr::str_replace()

stringr::str_replace()

stringr::str_replace()

stringr::str_replace()

stringr::str_replace()

More Helpful Functions

Additional Tools and Tricks

Additional Tools and Tricks

Working With Strings

Factors

Factors

Factors

Factors

A Potential Pitfall

The package forcats

Some Handy Tricks

Some Handy Tricks

`stringr::str_detect()`

`stringr::str_detect()`

`stringr::str_detect()`

`stringr::str_detect()`

`stringr::str_view()`

`stringr::str_extract()`

`stringr::str_extract()`

`stringr::str_extract()`

`stringr::str_extract()`

`stringr::str_extract_all()`

`stringr::str_remove()`

`stringr::str_replace()`

`stringr::str_replace()`

`stringr::str_replace()`

`stringr::str_replace()`

`stringr::str_replace()`

The package `forcats`