Working With Text

ASI: Introduction to R

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

Published

September 3, 2025

Regular Expressions

Regular Expressions

  • Regular Expressions allow more powerful pattern matching. We can:
    • match sets of characters
    • include wildcards
    • capture multiple patterns and return in any order
  • regex exist in most languages (e.g. python, bash etc)
  • R does have some unique syntax
  • Too complex for our time-frame \(\implies\) just a brief introduction

Sets Of Characters

  • Sets can be specified using []
    • [aeiou] would match any vowel
    • [abc] would match either a, b or c
  • Sets can include ranges
    • [A-Z] for any uppercase letter
    • [a-z] for any lowercase letter
    • [0-9] for any number

Sets Of Characters

  • Predefined sets also exist
    • [:alpha:] matches any alphabetic character
    • [:alnum:] matches any alpha-numeric character
  • All defined at ?regex

Sets Of Characters

## Remove all vowels using [aeiou]
str_remove_all(treats, pattern = "[aeiou]")
[1] "ppl p"     "bnn splt"  "chrry trt" "ppl crmbl" "bnn brd"  


## Replace all vowels with a dash
str_replace_all(treats, pattern = "[aeiou]", replacement = "-")
[1] "-ppl- p--"     "b-n-n- spl-t"  "ch-rry t-rt"   "-ppl- cr-mbl-"
[5] "b-n-n- br--d" 

Wildcards

  • Unlike many common-use situations, the regex wildcards is .
    • * has a different meaning
## Extract the letter a followed by anything
str_extract(treats, "a.")
[1] "ap" "an" "ar" "ap" "an"

Extending Matches

  • Matches can be extended by adding +
    • Matches the previous one or more times
    • * matches zero or more times
  • If using .+ \(\implies\) one or more wildcards
## Extract a followed by a wildcard, one or more times
str_extract(treats, "a.+")
[1] "apple pie"     "anana split"   "art"           "apple crumble"
[5] "anana bread"  
## Extract the first word from each treat
str_extract(treats, "[a-z]+")
[1] "apple"  "banana" "cherry" "apple"  "banana"

Capturing Patterns

  • Placing a pattern inside () “captures” the pattern
    • Can be returned in the replacement
  • Captured patterns are given numbers \(\implies\) the first capture is 1
    • In R we return captures using \\1, \\2 etc
## Capture each word, then return in the opposite order
str_replace_all(treats, "([a-z]+) ([a-z]+)", "\\2 \\1")
[1] "pie apple"     "split banana"  "tart cherry"   "crumble apple"
[5] "bread banana" 
## Just return one of the words amongst other text
str_replace_all(treats, "([a-z]+) ([a-z]+)", "I'd like \\2 for dessert")
[1] "I'd like pie for dessert"     "I'd like split for dessert"  
[3] "I'd like tart for dessert"    "I'd like crumble for dessert"
[5] "I'd like bread for dessert"  

Anchoring Patterns

  • A pattern can be anchored to the start of a string using ^
## Match any pattern with an `a`
str_detect(treats, "a")
[1] TRUE TRUE TRUE TRUE TRUE
## Ensure the `a` is the first character
str_detect(treats, "^a")
[1]  TRUE FALSE FALSE  TRUE FALSE
  • A pattern can also be anchored to the end of a string using $
## Match any pattern with an `e`
str_detect(treats, "e")
[1]  TRUE FALSE  TRUE  TRUE  TRUE
## Ensure the `e` is the last character
str_detect(treats, "e$")
[1]  TRUE FALSE FALSE  TRUE FALSE

Excluding Characters

  • To make things confusing, placing ^ inside [] negates the character inside []
## Remove all vowels
str_remove_all(treats, "[aeiou]")
[1] "ppl p"     "bnn splt"  "chrry trt" "ppl crmbl" "bnn brd"  
## Remove everything that isn't a vowel
str_remove_all(treats, "[^aeiou]")
[1] "aeie"  "aaai"  "ea"    "aeue"  "aaaea"

Escaping Characters

  • The use of special characters (e.g. ^, $, ., +, *, [, ], (, )) is very powerful
  • What if we need to exactly match one of them?
    • Most commonly we need to match a . within a file name
  • We escape the default meaning of a value using \\
    • This is how \\1 returns captures instead of the value 1
    • To match . exactly, we use \\.

Challenges

  1. Replace the 0 at the beginning of these phone numbers with +61
phones <- c("0499123456", "0498760432")
  1. Remove the transcript version numbers from the following
ids <- c("ENST00000376207.10", "ENST00000376199.7")
Note

Most challenges can be solved more than one way. The right way is the one that works!