library(tidyverse)
Working With Text
ASI: Introduction to R
Text Strings
Text Manipulation
Wrangling text is a common task using R
- Renaming columns for better axis/legend labels in
ggplot
- Change to title case
- Remove underscores & replace with spaces
- Correcting data entry errors
- “Y”, “yes”, “No”
- Extracting key information from filenames
- “passage01_treat.bam”, “passage01_control.bam”
Session Outline
- Basic string manipulation using
stringr
- Is loaded with the
tidyverse
- Is loaded with the
- Brief introduction to regular expressions
- Categorical variables using
forcats
- Also loaded with the
tidyverse
- Also loaded with the
Text Manipulation
- Start a new R script:
text.R
- Next, create the vector we’ll mess around with
## Create a character vector for this session
<- c("apple pie", "banana split", "cherry tart", "apple crumble", "banana bread") treats
Key Utility Functions
- Changing case is common and straightforward
## Convert every character to upper-case
str_to_upper(treats)
[1] "APPLE PIE" "BANANA SPLIT" "CHERRY TART" "APPLE CRUMBLE"
[5] "BANANA BREAD"
## Convert the first letter of every word to upper-case
str_to_title(treats)
[1] "Apple Pie" "Banana Split" "Cherry Tart" "Apple Crumble"
[5] "Banana Bread"
## Convert the first letter of the first word to upper-case
str_to_sentence(treats)
[1] "Apple pie" "Banana split" "Cherry tart" "Apple crumble"
[5] "Banana bread"
str_to_lower()
won’t have any effect here
Pattern Detection
## Find which values match a given pattern
str_detect(treats, "nana")
[1] FALSE TRUE FALSE FALSE TRUE
str_detect()
contains anegate
argument- Flips the results to those NOT matching the pattern
## Find which values DON'T match a given pattern
str_detect(treats, "nana", negate = TRUE)
[1] TRUE FALSE TRUE TRUE FALSE
Subsetting By Pattern
- We can also subset elements which match our pattern
str_subset(treats, "apple")
[1] "apple pie" "apple crumble"
negate
is also an argument forstr_subset()
str_subset(treats, "apple", negate = TRUE)
[1] "banana split" "cherry tart" "banana bread"
Extracting Patterns
- We can extract patterns from each value
str_extract(treats, "apple")
[1] "apple" NA NA "apple" NA
- Or we can simply remove them
- NB: We’ve also removed the space after apple here
str_remove(treats, "apple ")
[1] "pie" "banana split" "cherry tart" "crumble" "banana bread"
Replacing Patterns
## Replace the space with an underscore
str_replace(treats, pattern = " ", replacement = "-")
[1] "apple-pie" "banana-split" "cherry-tart" "apple-crumble"
[5] "banana-bread"
## Replace an `a` with `u`
str_replace(treats, pattern = "a", replacement = "u")
[1] "upple pie" "bunana split" "cherry turt" "upple crumble"
[5] "bunana bread"
- Note this only replaced the first occurence
## Replace all `a`s with `u`s
str_replace_all(treats, pattern = "a", replacement = "u")
[1] "upple pie" "bununu split" "cherry turt" "upple crumble"
[5] "bununu breud"
Using _all
versions
str_remove(treats, "a")
[1] "pple pie" "bnana split" "cherry trt" "pple crumble" "bnana bread"
str_remove_all(treats, "a")
[1] "pple pie" "bnn split" "cherry trt" "pple crumble" "bnn bred"
Regular Expressions
Regular Expressions
- Regular Expressions allow more powerful pattern matching. We can:
- match sets of characters
- include wildcards
- capture multiple patterns and return in any order
regex
exist in most languages (e.g.python
,bash
etc)R
does have some unique syntax- Too complex for our time-frame \(\implies\) just a brief introduction
Sets Of Characters
- Sets can be specified using
[]
[aeiou]
would match any vowel[abc]
would match eithera
,b
orc
- Sets can include ranges
[A-Z]
for any uppercase letter[a-z]
for any lowercase letter[0-9]
for any number
Wildcards
- Unlike many common-use situations, the
regex
wildcards is.
*
has a different meaning
## Extract the letter a followed by anything
str_extract(treats, "a.")
[1] "ap" "an" "ar" "ap" "an"
Extending Matches
- Matches can be extended by adding
+
- Matches the previous one or more times
*
matches zero or more times
- If using
.+
\(\implies\) one or more wildcards
## Extract a followed by a wildcard, one or more times
str_extract(treats, "a.+")
[1] "apple pie" "anana split" "art" "apple crumble"
[5] "anana bread"
## Extract the first word from each treat
str_extract(treats, "[a-z]+")
[1] "apple" "banana" "cherry" "apple" "banana"
Capturing Patterns
- Placing a pattern inside
()
“captures” the pattern- Can be returned in the replacement
- Captured patterns are given numbers \(\implies\) the first capture is
1
- In
R
we return captures using\\1
,\\2
etc
- In
## Capture each word, then return in the opposite order
str_replace_all(treats, "([a-z]+) ([a-z]+)", "\\2 \\1")
[1] "pie apple" "split banana" "tart cherry" "crumble apple"
[5] "bread banana"
## Just return one of the words amongst other text
str_replace_all(treats, "([a-z]+) ([a-z]+)", "I'd like \\2 for dessert")
[1] "I'd like pie for dessert" "I'd like split for dessert"
[3] "I'd like tart for dessert" "I'd like crumble for dessert"
[5] "I'd like bread for dessert"
Anchoring Patterns
- A pattern can be anchored to the start of a string using
^
## Match any pattern with an `a`
str_detect(treats, "a")
[1] TRUE TRUE TRUE TRUE TRUE
## Ensure the `a` is the first character
str_detect(treats, "^a")
[1] TRUE FALSE FALSE TRUE FALSE
- A pattern can also be anchored to the end of a string using
$
## Match any pattern with an `e`
str_detect(treats, "e")
[1] TRUE FALSE TRUE TRUE TRUE
## Ensure the `e` is the last character
str_detect(treats, "e$")
[1] TRUE FALSE FALSE TRUE FALSE
Excluding Characters
- To make things confusing, placing
^
inside[]
negates the character inside[]
## Remove all vowels
str_remove_all(treats, "[aeiou]")
[1] "ppl p" "bnn splt" "chrry trt" "ppl crmbl" "bnn brd"
## Remove everything that isn't a vowel
str_remove_all(treats, "[^aeiou]")
[1] "aeie" "aaai" "ea" "aeue" "aaaea"
Escaping Characters
- The use of special characters (e.g.
^
,$
,.
,+
,*
,[
,]
,(
,)
) is very powerful
- What if we need to exactly match one of them?
- Most commonly we need to match a
.
within a file name
- Most commonly we need to match a
- We escape the default meaning of a value using
\\
- This is how
\\1
returns captures instead of the value1
- To match
.
exactly, we use\\.
- This is how
Challenges
- Replace the
0
at the beginning of these phone numbers with+61
<- c("0499123456", "0498760432") phones
- Remove the transcript version numbers from the following
<- c("ENST00000376207.10", "ENST00000376199.7") ids
Note
Most challenges can be solved more than one way. The right way is the one that works!
Categorical Variables
Factors
- Categorical Variables are called factors in
R
- Well handled by the package
forcats
- Is loaded with
library(tidyverse)
- Is loaded with
- Can look like text strings but are subtly different
- Multiple repeated values \(\implies\) categories
- e.g. The islands in the penguins dataset
Using forcats
as.factor()
andfactor()
are in the packagebase
forcats
providesas_factor()
andfct()
- Very similar, but can differ in automatic ordering
- Can easily set levels by frequency:
fct_infreq()
- Or in revers:
fct_rev()
- Or in revers:
- Low frequency categories can be merged:
fct_lump()
,fct_lump_n()
,fct_lump_prop()