Welcome & Introduction

RAdelaide 2024

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 9, 2024




http://blackochrelabs.au/RAdelaide24

Introduction

Who Am I?

Stephen (Stevie) Pederson (They/Them)

  • Adelaide, Kaurna Country, SA
  • Bioinformatician, Black Ochre Data Labs, Telethon Kids Institute
  • Bioinformatician, Dame Roma Mitchell Cancer Research Laboratories (2020-2022)
  • Co-ordinator, UofA Bioinformatics Hub (2014-2020)
  • PhD (2008-2018) was a Bayesian Model for Transcript-Level Analysis
    • MCMC Engine written in C & R (No R Studio. No Rcpp)

Who Am I?

Stephen (Stevie) Pederson (They/Them)

  • R User for ~20 years \(\implies\) learnt when R was difficult!
  • Senior Author of 7 Bioconductor Packages
    • ngsReports, extraChIPs, motifTestR, transmogR
    • strandCheckR, sSNAPPY, tadar

Made countless typos, horrible decisions and catastrophic errors

I crash R at least once a week…

Today’s Tutors

  • Dr Jimmy Breen, Dr Liza Kretzschmar & Dr Alastair Ludington (Black Ochre Data Labs)
  • Dr Paul Wang, Dr John Salamon (SAGC)
  • Dr Na (Charlotte) Sai (University of Adelaide)

Housekeeping

  • Toilets are back near the lifts
  • Catering will be downstairs in the foyer

Homepage and Material

  • The workshop homepage is http://blackochrelabs.au/RAdelaide24
    • Data and course material available here
    • Will stay live in perpetuity
  • Links to notes available
    • Slides are directly re-formatted as a simple webpage
    • Slides are visible by clicking the RevealJS link below the TOC
  • Group communication can be done through https://bioinformaticshubsa.slack.com/
    • Join the #radelaide24 channel

Course Aims

  • Provide a deep understanding of how to work with data in R
    • Importing Data
    • Visualising Data
    • Understanding Data
  • Enable use of modern analytic approaches
    \(\implies\) reproducible research
  • Not just how \(\implies\) a deep understanding of underlying structures
  • The more code you type the more you learn

A Brief Introduction to R

Why use R?

  • Heavily used for analysis of biological data (along with Python)
    • Can handle extremely large datasets
    • Packages explicitly designed for complex analysis
    • Huge user base of biological researchers
    • (Can be) very fast
  • Very easy to dynamically interact with large datasets
    • Can also run as static scripts on HPC clusters

Why use R?

  • Reproducible Research!!!
    • Keep records (i.e. scripts) of every step of every analysis
    • Transparent methods
    • Integration with version control such as git
  • Avoids common Excel pitfalls (almost) never modify files on disk!

Experience is the best teacher \(\implies\) please practice your skills

What is R?

What is R?

  • Derivative of S (Chambers 1977)
  • R first appeared in 1993
    • Ross Ihaka and Robert Gentleman (U of Auckland)
    • Disentangled some proprietary S code \(\implies\) open-source
    • S ceased development in early 2000s (Chambers retired in 2005)
    • Now estimated >2 million users
    • Nice history article here (Chambers 2020)

What is R?

  • Open source language
    • No corporate ownership \(\implies\) free software
    • Code is managed by the community of users
  • R is formally run by a volunteer committee (R Core)
    • Mostly academics
    • John Chambers is still a member
  • Annual release schedule + patches
    • Most recent is R 4.4.1 (Jun 14)

Extending R, Chambers (2016)

R Packages

  • Packages are the key to R’s flexibility and power
    • Collections (or libraries) of related functions
    • ggplot2 \(\implies\) Generating plots
    • edgeR \(\implies\) Differential Gene Expression (DGE) for RNA-Seq

R Packages

  • \(>\) 16,000 packages are stored on CRAN (https://cran.r-project.org)
    • Not curated for statistical quality or documentation
    • Automated testing for successful installation
  • Bioconductor is a secondary repository (https://www.bioconductor.org)
    • \(>\) 2,200 packages with a more biological/genomics focus
    • Curated for language consistency & documentation

Helpful Resources


https://r4ds.had.co.nz/

https://r-graphics.org/

Using R

The R Console

  • Let’s try using R as a standalone tool \(\implies\) open R NOT RStudio
    • On linux: Open a terminal then enter R
    • On OSX: Click on your dock
    • On Windows: Click in your Start Menu
  • Do not open

The R Console

  • This is often referred to as the R Console
  • At it’s simplest R is just a calculator (Press Enter)
1 + 1
[1] 2
2 * 2
[1] 4
2 ^ 3
[1] 8
  • R has many standard functions
sqrt(2)
[1] 1.414214
log10(1000)
[1] 3
  • We place the value inside the brackets after the function name

The R Console

We can create objects with names

x <- 5
  • We have just created an object called x
  • The <- symbol is like an arrow i.e. “put the value 5 into x
    • Was a single key on keyboards in the 1970s

An APL Keyboard from the 1970s

The R Console

  • View the contents of the object by entering it’s name in the Console
x
[1] 5
  • The object x only exists in the R Environment
  • We can pass objects to functions and perform operations on them
x + 1
[1] 6
sqrt(x)
[1] 2.236068
x^2
[1] 25
x > 1
[1] TRUE

The R Console

  • Everything we’ve just done is trivial
  • Real analysis isn’t
  • If we perform a series of steps
    • Should we keep a copy of what we’ve done?
    • If so, how should we do that?
  • A common strategy is to record our code as an R Script
  • R Studio makes that easy & convenient
  • Many people now use RMarkdown to combine analysis, results and figures

R Studio

Introduction to RStudio

R and RStudio are two separate but connected things

  • R is like the engine of your car
  • RStudio is the ‘cabin’ we use to control the engine
    • Comes with extra features un-related to R that improve our ‘journey’
    • Known as an IDE (Integrated Development Environment)
  • R does all the calculations, manages the data, generates plots
    • i.e. gets us to our destination
  • RStudio helps manage our code, display the plots etc
    • i.e. makes our journey easier to navigate

What is RStudio

  • RStudio is product of a for profit company (Posit)
    • RStudio (Desktop) is free
    • RStudio Server has annual licence fee of $’000s
  • Posit employs many of the best & brightest package developers
    • e.g. tidyverse, bookdown, reticulate, roxygen2 etc.
    • The CEO (JJ Allaire) is still an active developer
  • Other IDEs also exist (e.g. emacs, VSCode)

Some very helpful features of RStudio

  • We can write scripts and execute code interactively
  • Predictive auto-completion
  • We can see everything we need (directories, plots, code, history etc.)
  • Use R Projects to manage each analysis
  • Integration with other languages
    • markdown, \(\LaTeX\), bash, python, C++, git etc.
  • Numerous add-ons to simplify larger tasks

Important Setup

  1. Create a directory on your computer for today’s material
    • We recommend RAdelaide24 in your home directory
  1. Now open RStudio
    • RStudio will always open in a directory somewhere
    • Look in the Files pane (bottom-right) to see where it’s looking
    • This is also the working directory for R

We want RStudio to be looking in our new directory (RAdelaide24)
\(\implies\)R Projects make this easy

Create an R Project

(Not needed for any using the Posit cloud)

File > New Project > Existing Directory

  • Browse to your RAdelaide24 directory \(\implies\) Create Project

Create an R Project

  • The R Project name is always the directory name
  • Not essential, but good practice and extremely useful
  • The Project Menu is in the top-right of RStudio

Create An Empty R Script

  1. File > New File > R Script
  2. Save As DataImport.R

RStudio

This is the basic layout we often work with

The Script Window

  • This is just a text editor.
  • We enter our commands here but they are not executed
  • Forms a record of everything we’ve done
    • Can repeat our analysis exactly
  • We’ll return here later \(\implies\) but first a quick tour

The R Console

The R Console

  • This is the R Console within the RStudio IDE
  • We’ve already explored this briefly
  • In the same grouping we also have Terminal
    • An approximation of a bash terminal (or PowerShell for Windows)
  • Background Jobs shows progress when compiling RMarkdown & Quarto
    • Not super relevant

The R Console

As well as performing simple calculations:

  • R has what we call an Environment (i.e. a Workspace)
  • We can define objects here or import data
    • Similar to a workbook in Excel with multiple worksheets
    • Much more flexible & powerful
    • Objects aren’t forced to be spreadsheets

The R Environment

Like we did earlier, in the R Console type:

x <- 5

Where have we created the object x?

  • Is it on your hard drive somewhere?
  • Is it in a file somewhere?
  • We have placed x in our R Environment
  • Formally known as your Global Environment

The R Environment

  • The R Environment is like your desktop
  • We keep all our relevant objects here
    • Multiple objects are usually created during an analysis
    • Can save all the objects in your environment as a single .RData object
    • R can be set to automatically save your environment on exit

The History Tab

  • Next to the Environment Tab is the History Tab
  • Keeps a record of the last ~200 lines of code
    • Very useful for remembering steps during exploration
    • Best practice is to enter + execute code from the Script Window
  • We can generally ignore the Connections and any other tabs
    • A git tab will also appear for those who use git in their project

Accessing Help

?sqrt
  • This will take you to the Help pane for the sqrt() function
    • Contents may look confusing at this point but will become clearer
  • Many inbuilt functions are organised into a package called base
    • Packages group similar/related functions together
    • base is always installed and loaded with R
  • Click on the underlined word Index at the bottom for a list of functions in the base packages
    • Absolutely no need to learn any of these

Additional Sources For Help

  • Help pages in R can be hit & miss
    • Some are excellent and informative \(\implies\) some aren’t
  • Bioconductor has a support forum for Bioconductor packages
    • All packages have a vignette (again varying quality)
  • Google is your friend \(\implies\) maybe ChatGPT?

The Plots Pane

  • We’ve already seen the Files pane
  • Plots appear in the Plots pane
plot(cars)

Other Panes

  • The Packages Pane is a bad idea
    • Can be disabled by popular request (I always do)
    • Temptation to click is strong
    • Very bad for reproducible research!!!
  • Viewer Pane is used when compiling HTML documents from RMarkdown
  • Every tab can be minimised/maximised using the buttons on the top right
  • Window separators can be be moved to resize panes manually

Cheatsheet and Shortcuts

Help > Cheatsheets > RStudio IDE Cheat Sheet

Page 2 has lots of hints:

  • Ctrl + 1 places focus on the Script Window
  • Ctrl + 2 places focus on the Console
  • Ctrl + 3 places focus on the Help Tab

References

Chambers, John M. 1977. Computational Methods for Data Analysis. New York: Wiley.
———. 2020. “S, r, and Data Science.” Proc. ACM Program. Lang. 4 (HOPL): 1–17.