Welcome & Introduction

RAdelaide 2024

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 9, 2024




http://blackochrelabs.au/RAdelaide24

Introduction

Who Am I?

Stephen (Stevie) Pederson (They/Them)

  • Adelaide, Kaurna Country, SA
  • Bioinformatician, Black Ochre Data Labs, Telethon Kids Institute

. . .

  • Bioinformatician, Dame Roma Mitchell Cancer Research Laboratories (2020-2022)
  • Co-ordinator, UofA Bioinformatics Hub (2014-2020)

. . .

  • PhD (2008-2018) was a Bayesian Model for Transcript-Level Analysis
    • MCMC Engine written in C & R (No R Studio. No Rcpp)
  • Best week ever: NAIDOC week, NB Awareness Week + R coding

Who Am I?

Stephen (Stevie) Pederson (They/Them)

  • R User for ~20 years \(\implies\) learnt when R was difficult!
  • Senior Author of 7 Bioconductor Packages
    • ngsReports, extraChIPs, motifTestR, transmogR
    • strandCheckR, sSNAPPY, tadar

. . .

Made countless typos, horrible decisions and catastrophic errors

. . .

I crash R at least once a week…

Today’s Tutors

  • Dr Jimmy Breen, Dr Liza Kretzschmar & Dr Alastair Ludington (Black Ochre Data Labs)
  • Dr Paul Wang, Dr John Salamon (SAGC)
  • Dr Na (Charlotte) Sai (University of Adelaide)

Housekeeping

  • Toilets are back near the lifts
  • Catering will be downstairs in the foyer

Thanks to everyone for sending your information through regarding dietary needs and existing knowledge

Homepage and Material

  • The workshop homepage is http://blackochrelabs.au/RAdelaide24
    • Data and course material available here
    • Will stay live in perpetuity
  • Links to notes available
    • Slides are directly re-formatted as a simple webpage
    • Slides are visible by clicking the RevealJS link below the TOC
  • Group communication can be done through https://bioinformaticshubsa.slack.com/
    • Join the #radelaide24 channel

Course Aims

  • Provide a deep understanding of how to work with data in R
    • Importing Data
    • Visualising Data
    • Understanding Data
  • Enable use of modern analytic approaches
    \(\implies\) reproducible research
  • Not just how \(\implies\) a deep understanding of underlying structures
  • The more code you type the more you learn

A Brief Introduction to R

Why use R?

  • Heavily used for analysis of biological data (along with Python)
    • Can handle extremely large datasets
    • Packages explicitly designed for complex analysis
    • Huge user base of biological researchers
    • (Can be) very fast
  • Very easy to dynamically interact with large datasets
    • Can also run as static scripts on HPC clusters

I regularly work with data containing millions of lines

Why use R?

  • Reproducible Research!!!
    • Keep records (i.e. scripts) of every step of every analysis
    • Transparent methods
    • Integration with version control such as git
  • Avoids common Excel pitfalls (almost) never modify files on disk!

. . .

Experience is the best teacher \(\implies\) please practice your skills

  • Discuss column sorting with Simon although I believe it’s improved

What is R?

What is R?

  • Derivative of S (Chambers 1977)
  • R first appeared in 1993
    • Ross Ihaka and Robert Gentleman (U of Auckland)
    • Disentangled some proprietary S code \(\implies\) open-source
    • S ceased development in early 2000s (Chambers retired in 2005)
    • Now estimated >2 million users
    • Nice history article here (Chambers 2020)
  • Ross Ihaka is of NZ Maori descent
  • Last commercial release of S-Plus in 2007

What is R?

  • Open source language
    • No corporate ownership \(\implies\) free software
    • Code is managed by the community of users
  • R is formally run by a volunteer committee (R Core)
    • Mostly academics
    • John Chambers is still a member
  • Annual release schedule + patches
    • Most recent is R 4.4.1 (Jun 14)

Extending R, Chambers (2016)
  • Being open source creates headaches for University & Business IT departments
  • No guarantees of being virus free. Has inherent security flaws particularly using R data files (which are often part of packages)
  • The community self-regulates
  • Release Names are references to Peanuts cartoons

R Packages

  • Packages are the key to R’s flexibility and power
    • Collections (or libraries) of related functions
    • ggplot2 \(\implies\) Generating plots
    • edgeR \(\implies\) Differential Gene Expression (DGE) for RNA-Seq

R Packages

  • \(>\) 16,000 packages are stored on CRAN (https://cran.r-project.org)
    • Not curated for statistical quality or documentation
    • Automated testing for successful installation

. . .

  • Bioconductor is a secondary repository (https://www.bioconductor.org)
    • \(>\) 2,200 packages with a more biological/genomics focus
    • Curated for language consistency & documentation
  • The gg in ggplot2 stands for “Grammar of Graphics”
  • Crap packages are generally identified by the users and then just not-used
  • Statistical rigour is usually checked during review of the accompanying publication

Where is R used?

  • Google, ATO, ABS etc
  • Very large community of users in finance (Dirk Eddelbuettel - Rcpp)
  • Genomics, Ecological Research, Public Health, Politics…
  • Strong integration with HPC systems like Amazon, Hadoop
  • Growing Machine Learning capacity
  • Even has it’s own peer-reviewed Journal (The R Journal)
  • I was offered a position by the ABS in 2022 specifically for my R skills to use analysing the housing market.
  • BODL offered me a position the exact same day so I stayed in academia

Helpful Resources


https://r4ds.had.co.nz/

https://r-graphics.org/

Much of today is inspired by a two-day developers workshop I attended with Hadley Wickham. Also gave me an opportunity to have some great conversations with Winston Chang

Using R

The R Console

  • Let’s try using R as a standalone tool \(\implies\) open R NOT RStudio
    • On linux: Open a terminal then enter R
    • On OSX: Click on your dock
    • On Windows: Click in your Start Menu
  • Do not open

  • This is the ‘ugly, old-school’ way of using R
  • Still very useful, e.g. testing code on a HPC

The R Console

  • This is often referred to as the R Console
  • At it’s simplest R is just a calculator (Press Enter)
1 + 1
[1] 2
2 * 2
[1] 4
2 ^ 3
[1] 8
  • R has many standard functions
sqrt(2)
[1] 1.414214
log10(1000)
[1] 3
  • We place the value inside the brackets after the function name

I never use a calculator program on my laptop, always R

The R Console

We can create objects with names

x <- 5

. . .

  • We have just created an object called x
  • The <- symbol is like an arrow i.e. “put the value 5 into x
    • Was a single key on keyboards in the 1970s

An APL Keyboard from the 1970s

Object names can be anything but should start with a letter not a number or special character

The R Console

  • View the contents of the object by entering it’s name in the Console
x
[1] 5
  • The object x only exists in the R Environment

. . .

  • We can pass objects to functions and perform operations on them
x + 1
[1] 6
sqrt(x)
[1] 2.236068
x^2
[1] 25
x > 1
[1] TRUE

The R Console

  • Everything we’ve just done is trivial
  • Real analysis isn’t
  • If we perform a series of steps
    • Should we keep a copy of what we’ve done?
    • If so, how should we do that?

. . .

  • A common strategy is to record our code as an R Script
  • R Studio makes that easy & convenient
  • Many people now use RMarkdown to combine analysis, results and figures

R Studio

Introduction to RStudio

R and RStudio are two separate but connected things

  • R is like the engine of your car

. . .

  • RStudio is the ‘cabin’ we use to control the engine
    • Comes with extra features un-related to R that improve our ‘journey’
    • Known as an IDE (Integrated Development Environment)

. . .

  • R does all the calculations, manages the data, generates plots
    • i.e. gets us to our destination

. . .

  • RStudio helps manage our code, display the plots etc
    • i.e. makes our journey easier to navigate

What is RStudio

  • RStudio is product of a for profit company (Posit)
    • RStudio (Desktop) is free
    • RStudio Server has annual licence fee of $’000s
  • Posit employs many of the best & brightest package developers
    • e.g. tidyverse, bookdown, reticulate, roxygen2 etc.
    • The CEO (JJ Allaire) is still an active developer
  • Other IDEs also exist (e.g. emacs, VSCode)
  • I remember being at the launch of RStudio (Coventry, 2011). It was a room full of R programmers thinking “holy crap, this changes everything”
  • RStudio/Posit is a corporation whilst R is an academic-led volunteer community. So far relatively good relationship
  • Heard JJ Allaire present some of his latest work a few weeks ago

Some very helpful features of RStudio

  • We can write scripts and execute code interactively
  • Predictive auto-completion
  • We can see everything we need (directories, plots, code, history etc.)

. . .

  • Use R Projects to manage each analysis
  • Integration with other languages
    • markdown, \(\LaTeX\), bash, python, C++, git etc.
  • Numerous add-ons to simplify larger tasks

Important Setup

  1. Create a directory on your computer for today’s material
    • We recommend RAdelaide24 in your home directory

. . .

  1. Now open RStudio
    • RStudio will always open in a directory somewhere
    • Look in the Files pane (bottom-right) to see where it’s looking
    • This is also the working directory for R

. . .

We want RStudio to be looking in our new directory (RAdelaide24)
\(\implies\)R Projects make this easy

Create an R Project

(Not needed for any using the Posit cloud)

File > New Project > Existing Directory

  • Browse to your RAdelaide24 directory \(\implies\) Create Project

Create an R Project

  • The R Project name is always the directory name
  • Not essential, but good practice and extremely useful
  • The Project Menu is in the top-right of RStudio
  • R Projects are simply a wrapper for keeping an analysis organised
    • Will always open in the R Project directory
    • You can easily navigate to a directory with all scripts and data
    • Makes managing file paths from your code very simple
  • R Projects can be particularly helpful when loading external files
  • Also when saving/exporting lots of files as part of your analysis

Create An Empty R Script

  1. File > New File > R Script
  2. Save As DataImport.R

RStudio

This is the basic layout we often work with

The Script Window

  • This is just a text editor.
  • We enter our commands here but they are not executed
  • Forms a record of everything we’ve done
    • Can repeat our analysis exactly
  • We’ll return here later \(\implies\) but first a quick tour

The R Console

The R Console

  • This is the R Console within the RStudio IDE
  • We’ve already explored this briefly

. . .

  • In the same grouping we also have Terminal
    • An approximation of a bash terminal (or PowerShell for Windows)

. . .

  • Background Jobs shows progress when compiling RMarkdown & Quarto
    • Not super relevant

The R Console

As well as performing simple calculations:

  • R has what we call an Environment (i.e. a Workspace)
  • We can define objects here or import data
    • Similar to a workbook in Excel with multiple worksheets
    • Much more flexible & powerful
    • Objects aren’t forced to be spreadsheets
  • When we create a new sheet in Excel, we’re actually creating an object.
  • Most commonly, it’s named Sheet1 or something similar
  • Has fixed dimensions for memory management

The R Environment

Like we did earlier, in the R Console type:

x <- 5

. . .

Where have we created the object x?

  • Is it on your hard drive somewhere?
  • Is it in a file somewhere?

. . .

  • We have placed x in our R Environment
  • Formally known as your Global Environment

The R Environment

  • The R Environment is like your desktop
  • We keep all our relevant objects here
    • Multiple objects are usually created during an analysis
    • Can save all the objects in your environment as a single .RData object
    • R can be set to automatically save your environment on exit

The History Tab

  • Next to the Environment Tab is the History Tab
  • Keeps a record of the last ~200 lines of code
    • Very useful for remembering steps during exploration
    • Best practice is to enter + execute code from the Script Window

. . .

  • We can generally ignore the Connections and any other tabs
    • A git tab will also appear for those who use git in their project

Accessing Help

?sqrt
  • This will take you to the Help pane for the sqrt() function
    • Contents may look confusing at this point but will become clearer

. . .

  • Many inbuilt functions are organised into a package called base
    • Packages group similar/related functions together
    • base is always installed and loaded with R
  • Click on the underlined word Index at the bottom for a list of functions in the base packages
    • Absolutely no need to learn any of these
  • May be issues with URL '/help/library/base/html/00Index.html' not found
  • The examples in this help page are a bit rubbish…

Additional Sources For Help

  • Help pages in R can be hit & miss
    • Some are excellent and informative \(\implies\) some aren’t
  • Bioconductor has a support forum for Bioconductor packages
    • All packages have a vignette (again varying quality)

. . .

  • Google is your friend \(\implies\) maybe ChatGPT?

As a package author, I’m always reading my own help pages. I simply can’t remember everything I’ve written

The Plots Pane

  • We’ve already seen the Files pane
  • Plots appear in the Plots pane
plot(cars)

Other Panes

  • The Packages Pane is a bad idea
    • Can be disabled by popular request (I always do)
    • Temptation to click is strong
    • Very bad for reproducible research!!!

. . .

  • Viewer Pane is used when compiling HTML documents from RMarkdown

. . .

  • Every tab can be minimised/maximised using the buttons on the top right
  • Window separators can be be moved to resize panes manually

Cheatsheet and Shortcuts

Help > Cheatsheets > RStudio IDE Cheat Sheet

Page 2 has lots of hints:

  • Ctrl + 1 places focus on the Script Window
  • Ctrl + 2 places focus on the Console
  • Ctrl + 3 places focus on the Help Tab

References

Chambers, John M. 1977. Computational Methods for Data Analysis. New York: Wiley.
———. 2020. “S, r, and Data Science.” Proc. ACM Program. Lang. 4 (HOPL): 1–17.