Working With Text

ASI: Introduction to R

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

September 3, 2025

Text Strings

Text Manipulation

Wrangling text is a common task using R

Renaming columns for better axis/legend labels in ggplot
- Change to title case
- Remove underscores & replace with spaces

Correcting data entry errors
- “Y”, “yes”, “No”

Extracting key information from filenames
- “passage01_treat.bam”, “passage01_control.bam”

Session Outline

Basic string manipulation using stringr
- Is loaded with the tidyverse
Brief introduction to regular expressions
Categorical variables using forcats
- Also loaded with the tidyverse

Text Manipulation

Start a new R script: text.R

library(tidyverse)

Next, create the vector we’ll mess around with

## Create a character vector for this session
treats <- c("apple pie", "banana split", "cherry tart", "apple crumble", "banana bread")

Key Utility Functions

Changing case is common and straightforward

## Convert every character to upper-case
str_to_upper(treats)

[1] "APPLE PIE"     "BANANA SPLIT"  "CHERRY TART"   "APPLE CRUMBLE"
[5] "BANANA BREAD"

## Convert the first letter of every word to upper-case
str_to_title(treats)

[1] "Apple Pie"     "Banana Split"  "Cherry Tart"   "Apple Crumble"
[5] "Banana Bread"

## Convert the first letter of the first word to upper-case
str_to_sentence(treats)

[1] "Apple pie"     "Banana split"  "Cherry tart"   "Apple crumble"
[5] "Banana bread"

str_to_lower() won’t have any effect here

Key Utility Functions

Sometimes really long strings can be truncated
- The length will be fixed at the given width
- Any exceeding this will have ... in the last 3 positions

str_trunc(treats, 10)

[1] "apple pie"  "banana ..." "cherry ..." "apple c..." "banana ..."

Line breaks in R are encoded with "\n"
Can wrap axis labels at a maximum length

str_wrap(treats, 10)

[1] "apple pie"      "banana\nsplit"  "cherry\ntart"   "apple\ncrumble"
[5] "banana\nbread"

Key Utility Functions

We can simply count the number of characters

str_length(treats)

[1]  9 12 11 13 12

Padding strings can be super-helpful when dealing with numbers

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

str_pad(1:10, width = 2, pad = "0")

 [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"

Pattern Detection

## Find which values match a given pattern
str_detect(treats, "nana")

[1] FALSE  TRUE FALSE FALSE  TRUE

str_detect() contains a negate argument
Flips the results to those NOT matching the pattern

## Find which values DON'T match a given pattern
str_detect(treats, "nana", negate = TRUE)

[1]  TRUE FALSE  TRUE  TRUE FALSE

Subsetting By Pattern

We can also subset elements which match our pattern

str_subset(treats, "apple")

[1] "apple pie"     "apple crumble"

negate is also an argument for str_subset()

str_subset(treats, "apple", negate = TRUE)

[1] "banana split" "cherry tart"  "banana bread"

Extracting Patterns

We can extract patterns from each value

str_extract(treats, "apple")

[1] "apple" NA      NA      "apple" NA

Or we can simply remove them
- NB: We’ve also removed the space after apple here

str_remove(treats, "apple ")

[1] "pie"          "banana split" "cherry tart"  "crumble"      "banana bread"

Replacing Patterns

## Replace the space with an underscore
str_replace(treats, pattern = " ", replacement = "-")

[1] "apple-pie"     "banana-split"  "cherry-tart"   "apple-crumble"
[5] "banana-bread"

## Replace an `a` with `u`
str_replace(treats, pattern = "a", replacement = "u")

[1] "upple pie"     "bunana split"  "cherry turt"   "upple crumble"
[5] "bunana bread"

Note this only replaced the first occurence

## Replace all `a`s with `u`s
str_replace_all(treats, pattern = "a", replacement = "u")

[1] "upple pie"     "bununu split"  "cherry turt"   "upple crumble"
[5] "bununu breud"

Using `_all` versions

str_remove(treats, "a")

[1] "pple pie"     "bnana split"  "cherry trt"   "pple crumble" "bnana bread"

str_remove_all(treats, "a")

[1] "pple pie"     "bnn split"    "cherry trt"   "pple crumble" "bnn bred"

Using `_all` versions

str_extract_all() produces an R object known as a list
- A bit trickier to work with

str_extract(treats, "na")

[1] NA   "na" NA   NA   "na"

str_extract_all(treats, "na")

[[1]]
character(0)

[[2]]
[1] "na" "na"

[[3]]
character(0)

[[4]]
character(0)

[[5]]
[1] "na" "na"

Regular Expressions

Regular Expressions allow more powerful pattern matching. We can:
- match sets of characters
- include wildcards
- capture multiple patterns and return in any order

regex exist in most languages (e.g. python, bash etc)
R does have some unique syntax
Too complex for our time-frame $\implies$ just a brief introduction

Sets Of Characters

Sets can be specified using []
- [aeiou] would match any vowel
- [abc] would match either a, b or c

Sets can include ranges
- [A-Z] for any uppercase letter
- [a-z] for any lowercase letter
- [0-9] for any number

Sets Of Characters

Predefined sets also exist
- [:alpha:] matches any alphabetic character
- [:alnum:] matches any alpha-numeric character
All defined at ?regex

Sets Of Characters

## Remove all vowels using [aeiou]
str_remove_all(treats, pattern = "[aeiou]")

[1] "ppl p"     "bnn splt"  "chrry trt" "ppl crmbl" "bnn brd"

## Replace all vowels with a dash
str_replace_all(treats, pattern = "[aeiou]", replacement = "-")

[1] "-ppl- p--"     "b-n-n- spl-t"  "ch-rry t-rt"   "-ppl- cr-mbl-"
[5] "b-n-n- br--d"

Wildcards

Unlike many common-use situations, the regex wildcards is .
- * has a different meaning

## Extract the letter a followed by anything
str_extract(treats, "a.")

[1] "ap" "an" "ar" "ap" "an"

Extending Matches

Matches can be extended by adding +
- Matches the previous one or more times
- * matches zero or more times
If using .+ $\implies$ one or more wildcards

## Extract a followed by a wildcard, one or more times
str_extract(treats, "a.+")

[1] "apple pie"     "anana split"   "art"           "apple crumble"
[5] "anana bread"

## Extract the first word from each treat
str_extract(treats, "[a-z]+")

[1] "apple"  "banana" "cherry" "apple"  "banana"

Capturing Patterns

Placing a pattern inside () “captures” the pattern
- Can be returned in the replacement
Captured patterns are given numbers $\implies$ the first capture is 1
- In R we return captures using \\1, \\2 etc

## Capture each word, then return in the opposite order
str_replace_all(treats, "([a-z]+) ([a-z]+)", "\\2 \\1")

[1] "pie apple"     "split banana"  "tart cherry"   "crumble apple"
[5] "bread banana"

## Just return one of the words amongst other text
str_replace_all(treats, "([a-z]+) ([a-z]+)", "I'd like \\2 for dessert")

[1] "I'd like pie for dessert"     "I'd like split for dessert"  
[3] "I'd like tart for dessert"    "I'd like crumble for dessert"
[5] "I'd like bread for dessert"

Anchoring Patterns

A pattern can be anchored to the start of a string using ^

## Match any pattern with an `a`
str_detect(treats, "a")

[1] TRUE TRUE TRUE TRUE TRUE

## Ensure the `a` is the first character
str_detect(treats, "^a")

[1]  TRUE FALSE FALSE  TRUE FALSE

A pattern can also be anchored to the end of a string using $

## Match any pattern with an `e`
str_detect(treats, "e")

[1]  TRUE FALSE  TRUE  TRUE  TRUE

## Ensure the `e` is the last character
str_detect(treats, "e$")

[1]  TRUE FALSE FALSE  TRUE FALSE

Excluding Characters

To make things confusing, placing ^ inside [] negates the character inside []

## Remove all vowels
str_remove_all(treats, "[aeiou]")

[1] "ppl p"     "bnn splt"  "chrry trt" "ppl crmbl" "bnn brd"

## Remove everything that isn't a vowel
str_remove_all(treats, "[^aeiou]")

[1] "aeie"  "aaai"  "ea"    "aeue"  "aaaea"

Escaping Characters

The use of special characters (e.g. ^, $, ., +, *, [, ], (, )) is very powerful

What if we need to exactly match one of them?
- Most commonly we need to match a . within a file name

We escape the default meaning of a value using \\
- This is how \\1 returns captures instead of the value 1
- To match . exactly, we use \\.

Challenges

Replace the 0 at the beginning of these phone numbers with +61

phones <- c("0499123456", "0498760432")

Remove the transcript version numbers from the following

ids <- c("ENST00000376207.10", "ENST00000376199.7")

Note

Most challenges can be solved more than one way. The right way is the one that works!

Categorical Variables

Factors

Categorical Variables are called factors in R
Well handled by the package forcats
- Is loaded with library(tidyverse)

Can look like text strings but are subtly different
- Multiple repeated values $\implies$ categories
- e.g. The islands in the penguins dataset

Factors

Character vectors will always display values surrounded by quotation marks
Factors display values without quotation marks except in a tibble

dose <- c("High", "Med", "Low")
dose

[1] "High" "Med"  "Low"

as.factor(dose)

[1] High Med  Low 
Levels: High Low Med

tibble(
  dose = dose, 
  factor = as.factor(dose)
)

# A tibble: 3 × 2
  dose  factor
  <chr> <fct> 
1 High  High  
2 Med   Med   
3 Low   Low

Factors

as.factor() will set categories (i.e. levels) in alpha-numeric order
- Will be coerced automatically when plotting

## Notice the category levels are in order
## High Low Med
as.factor(dose)

[1] High Med  Low 
Levels: High Low Med

We can set manually using factor()

factor(dose, levels = c("Low", "Med", "High"))

[1] High Med  Low 
Levels: Low Med High

Using `forcats`

as.factor() and factor() are in the package base
forcats provides as_factor() and fct()
- Very similar, but can differ in automatic ordering

Can easily set levels by frequency: fct_infreq()
- Or in revers: fct_rev()

Low frequency categories can be merged:
- fct_lump(), fct_lump_n(), fct_lump_prop()

Using `forcats`

NA values can be set to a specific level
- fct_na_value_to_level()

stringr functions can be used to tidy levels:
- fct_relabel(f, .fun)

Working With Text

Text Strings

Text Manipulation

Session Outline

Text Manipulation

Key Utility Functions

Key Utility Functions

Key Utility Functions

Pattern Detection

Subsetting By Pattern

Extracting Patterns

Replacing Patterns

Using _all versions

Using _all versions

Regular Expressions

Regular Expressions

Sets Of Characters

Sets Of Characters

Sets Of Characters

Wildcards

Extending Matches

Capturing Patterns

Anchoring Patterns

Excluding Characters

Escaping Characters

Challenges

Categorical Variables

Factors

Factors

Factors

Using forcats

Using forcats

Using `_all` versions

Using `_all` versions

Using `forcats`

Using `forcats`