Working With Text

ASI: Introduction to R

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

September 3, 2025

Text Strings

Text Manipulation

Wrangling text is a common task using R

  • Renaming columns for better axis/legend labels in ggplot
    • Change to title case
    • Remove underscores & replace with spaces
  • Correcting data entry errors
    • “Y”, “yes”, “No”
  • Extracting key information from filenames
    • “passage01_treat.bam”, “passage01_control.bam”

Session Outline

  • Basic string manipulation using stringr
    • Is loaded with the tidyverse
  • Brief introduction to regular expressions
  • Categorical variables using forcats
    • Also loaded with the tidyverse

Text Manipulation

  • Start a new R script: text.R
library(tidyverse)
  • Next, create the vector we’ll mess around with
## Create a character vector for this session
treats <- c("apple pie", "banana split", "cherry tart", "apple crumble", "banana bread")

Key Utility Functions

  • Changing case is common and straightforward
## Convert every character to upper-case
str_to_upper(treats)
[1] "APPLE PIE"     "BANANA SPLIT"  "CHERRY TART"   "APPLE CRUMBLE"
[5] "BANANA BREAD" 
## Convert the first letter of every word to upper-case
str_to_title(treats)
[1] "Apple Pie"     "Banana Split"  "Cherry Tart"   "Apple Crumble"
[5] "Banana Bread" 
## Convert the first letter of the first word to upper-case
str_to_sentence(treats)
[1] "Apple pie"     "Banana split"  "Cherry tart"   "Apple crumble"
[5] "Banana bread" 
  • str_to_lower() won’t have any effect here

Key Utility Functions

  • Sometimes really long strings can be truncated
    • The length will be fixed at the given width
    • Any exceeding this will have ... in the last 3 positions
str_trunc(treats, 10)
[1] "apple pie"  "banana ..." "cherry ..." "apple c..." "banana ..."
  • Line breaks in R are encoded with "\n"
  • Can wrap axis labels at a maximum length
str_wrap(treats, 10)
[1] "apple pie"      "banana\nsplit"  "cherry\ntart"   "apple\ncrumble"
[5] "banana\nbread" 

Key Utility Functions

  • We can simply count the number of characters
str_length(treats)
[1]  9 12 11 13 12
  • Padding strings can be super-helpful when dealing with numbers
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
str_pad(1:10, width = 2, pad = "0")
 [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"

Pattern Detection

## Find which values match a given pattern
str_detect(treats, "nana")
[1] FALSE  TRUE FALSE FALSE  TRUE
  • str_detect() contains a negate argument
  • Flips the results to those NOT matching the pattern
## Find which values DON'T match a given pattern
str_detect(treats, "nana", negate = TRUE)
[1]  TRUE FALSE  TRUE  TRUE FALSE

Subsetting By Pattern

  • We can also subset elements which match our pattern
str_subset(treats, "apple")
[1] "apple pie"     "apple crumble"
  • negate is also an argument for str_subset()
str_subset(treats, "apple", negate = TRUE)
[1] "banana split" "cherry tart"  "banana bread"

Extracting Patterns

  • We can extract patterns from each value
str_extract(treats, "apple")
[1] "apple" NA      NA      "apple" NA     
  • Or we can simply remove them
    • NB: We’ve also removed the space after apple here
str_remove(treats, "apple ")
[1] "pie"          "banana split" "cherry tart"  "crumble"      "banana bread"

Replacing Patterns

## Replace the space with an underscore
str_replace(treats, pattern = " ", replacement = "-")
[1] "apple-pie"     "banana-split"  "cherry-tart"   "apple-crumble"
[5] "banana-bread" 


## Replace an `a` with `u`
str_replace(treats, pattern = "a", replacement = "u")
[1] "upple pie"     "bunana split"  "cherry turt"   "upple crumble"
[5] "bunana bread" 
  • Note this only replaced the first occurence
## Replace all `a`s with `u`s
str_replace_all(treats, pattern = "a", replacement = "u")
[1] "upple pie"     "bununu split"  "cherry turt"   "upple crumble"
[5] "bununu breud" 

Using _all versions

str_remove(treats, "a")
[1] "pple pie"     "bnana split"  "cherry trt"   "pple crumble" "bnana bread" 
str_remove_all(treats, "a")
[1] "pple pie"     "bnn split"    "cherry trt"   "pple crumble" "bnn bred"    

Using _all versions

  • str_extract_all() produces an R object known as a list
    • A bit trickier to work with
str_extract(treats, "na")
[1] NA   "na" NA   NA   "na"
str_extract_all(treats, "na")
[[1]]
character(0)

[[2]]
[1] "na" "na"

[[3]]
character(0)

[[4]]
character(0)

[[5]]
[1] "na" "na"

Regular Expressions

Regular Expressions

  • Regular Expressions allow more powerful pattern matching. We can:
    • match sets of characters
    • include wildcards
    • capture multiple patterns and return in any order
  • regex exist in most languages (e.g. python, bash etc)
  • R does have some unique syntax
  • Too complex for our time-frame \(\implies\) just a brief introduction

Sets Of Characters

  • Sets can be specified using []
    • [aeiou] would match any vowel
    • [abc] would match either a, b or c
  • Sets can include ranges
    • [A-Z] for any uppercase letter
    • [a-z] for any lowercase letter
    • [0-9] for any number

Sets Of Characters

  • Predefined sets also exist
    • [:alpha:] matches any alphabetic character
    • [:alnum:] matches any alpha-numeric character
  • All defined at ?regex

Sets Of Characters

## Remove all vowels using [aeiou]
str_remove_all(treats, pattern = "[aeiou]")
[1] "ppl p"     "bnn splt"  "chrry trt" "ppl crmbl" "bnn brd"  


## Replace all vowels with a dash
str_replace_all(treats, pattern = "[aeiou]", replacement = "-")
[1] "-ppl- p--"     "b-n-n- spl-t"  "ch-rry t-rt"   "-ppl- cr-mbl-"
[5] "b-n-n- br--d" 

Wildcards

  • Unlike many common-use situations, the regex wildcards is .
    • * has a different meaning
## Extract the letter a followed by anything
str_extract(treats, "a.")
[1] "ap" "an" "ar" "ap" "an"

Extending Matches

  • Matches can be extended by adding +
    • Matches the previous one or more times
    • * matches zero or more times
  • If using .+ \(\implies\) one or more wildcards
## Extract a followed by a wildcard, one or more times
str_extract(treats, "a.+")
[1] "apple pie"     "anana split"   "art"           "apple crumble"
[5] "anana bread"  
## Extract the first word from each treat
str_extract(treats, "[a-z]+")
[1] "apple"  "banana" "cherry" "apple"  "banana"

Capturing Patterns

  • Placing a pattern inside () “captures” the pattern
    • Can be returned in the replacement
  • Captured patterns are given numbers \(\implies\) the first capture is 1
    • In R we return captures using \\1, \\2 etc
## Capture each word, then return in the opposite order
str_replace_all(treats, "([a-z]+) ([a-z]+)", "\\2 \\1")
[1] "pie apple"     "split banana"  "tart cherry"   "crumble apple"
[5] "bread banana" 
## Just return one of the words amongst other text
str_replace_all(treats, "([a-z]+) ([a-z]+)", "I'd like \\2 for dessert")
[1] "I'd like pie for dessert"     "I'd like split for dessert"  
[3] "I'd like tart for dessert"    "I'd like crumble for dessert"
[5] "I'd like bread for dessert"  

Anchoring Patterns

  • A pattern can be anchored to the start of a string using ^
## Match any pattern with an `a`
str_detect(treats, "a")
[1] TRUE TRUE TRUE TRUE TRUE
## Ensure the `a` is the first character
str_detect(treats, "^a")
[1]  TRUE FALSE FALSE  TRUE FALSE
  • A pattern can also be anchored to the end of a string using $
## Match any pattern with an `e`
str_detect(treats, "e")
[1]  TRUE FALSE  TRUE  TRUE  TRUE
## Ensure the `e` is the last character
str_detect(treats, "e$")
[1]  TRUE FALSE FALSE  TRUE FALSE

Excluding Characters

  • To make things confusing, placing ^ inside [] negates the character inside []
## Remove all vowels
str_remove_all(treats, "[aeiou]")
[1] "ppl p"     "bnn splt"  "chrry trt" "ppl crmbl" "bnn brd"  
## Remove everything that isn't a vowel
str_remove_all(treats, "[^aeiou]")
[1] "aeie"  "aaai"  "ea"    "aeue"  "aaaea"

Escaping Characters

  • The use of special characters (e.g. ^, $, ., +, *, [, ], (, )) is very powerful
  • What if we need to exactly match one of them?
    • Most commonly we need to match a . within a file name
  • We escape the default meaning of a value using \\
    • This is how \\1 returns captures instead of the value 1
    • To match . exactly, we use \\.

Challenges

  1. Replace the 0 at the beginning of these phone numbers with +61
phones <- c("0499123456", "0498760432")
  1. Remove the transcript version numbers from the following
ids <- c("ENST00000376207.10", "ENST00000376199.7")

Note

Most challenges can be solved more than one way. The right way is the one that works!

Categorical Variables

Factors

  • Categorical Variables are called factors in R
  • Well handled by the package forcats
    • Is loaded with library(tidyverse)
  • Can look like text strings but are subtly different
    • Multiple repeated values \(\implies\) categories
    • e.g. The islands in the penguins dataset

Factors

  • Character vectors will always display values surrounded by quotation marks
  • Factors display values without quotation marks except in a tibble
dose <- c("High", "Med", "Low")
dose
[1] "High" "Med"  "Low" 
as.factor(dose)
[1] High Med  Low 
Levels: High Low Med
tibble(
  dose = dose, 
  factor = as.factor(dose)
)
# A tibble: 3 × 2
  dose  factor
  <chr> <fct> 
1 High  High  
2 Med   Med   
3 Low   Low   

Factors

  • as.factor() will set categories (i.e. levels) in alpha-numeric order
    • Will be coerced automatically when plotting
## Notice the category levels are in order
## High Low Med
as.factor(dose)
[1] High Med  Low 
Levels: High Low Med
  • We can set manually using factor()
factor(dose, levels = c("Low", "Med", "High"))
[1] High Med  Low 
Levels: Low Med High

Using forcats

  • as.factor() and factor() are in the package base
  • forcats provides as_factor() and fct()
    • Very similar, but can differ in automatic ordering
  • Can easily set levels by frequency: fct_infreq()
    • Or in revers: fct_rev()
  • Low frequency categories can be merged:
    • fct_lump(), fct_lump_n(), fct_lump_prop()

Using forcats

  • NA values can be set to a specific level
    • fct_na_value_to_level()
  • stringr functions can be used to tidy levels:
    • fct_relabel(f, .fun)