http://blackochrelabs.au/RAdelaide24
Introduction
Who Am I?
Stephen (Stevie) Pederson (They/Them)
- Adelaide, Kaurna Country, SA
- Bioinformatician, Black Ochre Data Labs, Telethon Kids Institute
. . .
- Bioinformatician, Dame Roma Mitchell Cancer Research Laboratories (2020-2022)
- Co-ordinator, UofA Bioinformatics Hub (2014-2020)
. . .
- PhD (2008-2018) was a Bayesian Model for Transcript-Level Analysis
- MCMC Engine written in C & R (No R Studio. No
Rcpp
)
- Best week ever: NAIDOC week, NB Awareness Week + R coding
Who Am I?
Stephen (Stevie) Pederson (They/Them)
- R User for ~20 years \(\implies\) learnt when R was difficult!
- Senior Author of 7 Bioconductor Packages
ngsReports
, extraChIPs
, motifTestR
, transmogR
strandCheckR
, sSNAPPY
, tadar




. . .
Made countless typos, horrible decisions and catastrophic errors
. . .
I crash R at least once a week…
Today’s Tutors
- Dr Jimmy Breen, Dr Liza Kretzschmar & Dr Alastair Ludington (Black Ochre Data Labs)
- Dr Paul Wang, Dr John Salamon (SAGC)
- Dr Na (Charlotte) Sai (University of Adelaide)
Housekeeping
- Toilets are back near the lifts
- Catering will be downstairs in the foyer
Thanks to everyone for sending your information through regarding dietary needs and existing knowledge
Homepage and Material
- The workshop homepage is http://blackochrelabs.au/RAdelaide24
- Data and course material available here
- Will stay live in perpetuity
- Links to notes available
- Slides are directly re-formatted as a simple webpage
- Slides are visible by clicking the
RevealJS
link below the TOC
- Group communication can be done through https://bioinformaticshubsa.slack.com/
- Join the #radelaide24 channel
Course Aims
- Provide a deep understanding of how to work with data in R
- Importing Data
- Visualising Data
- Understanding Data
- Enable use of modern analytic approaches
\(\implies\) reproducible research
- Not just how \(\implies\) a deep understanding of underlying structures
- The more code you type the more you learn
A Brief Introduction to R
Why use R?
- Heavily used for analysis of biological data (along with Python)
- Can handle extremely large datasets
- Packages explicitly designed for complex analysis
- Huge user base of biological researchers
- (Can be) very fast
- Very easy to dynamically interact with large datasets
- Can also run as static scripts on HPC clusters
I regularly work with data containing millions of lines
Why use R?
- Reproducible Research!!!
- Keep records (i.e. scripts) of every step of every analysis
- Transparent methods
- Integration with version control such as
git
- Avoids common Excel pitfalls (almost) never modify files on disk!
. . .
Experience is the best teacher \(\implies\) please practice your skills
- Discuss column sorting with Simon although I believe it’s improved
What is R?
- Derivative of
S
(Chambers 1977)
R
first appeared in 1993
- Ross Ihaka and Robert Gentleman (U of Auckland)
- Disentangled some proprietary
S
code \(\implies\) open-source
S
ceased development in early 2000s (Chambers retired in 2005)
- Now estimated >2 million users
- Nice history article here (Chambers 2020)
- Ross Ihaka is of NZ Maori descent
- Last commercial release of S-Plus in 2007
What is R?
- Open source language
- No corporate ownership \(\implies\) free software
- Code is managed by the community of users
R
is formally run by a volunteer committee (R Core)
- Mostly academics
- John Chambers is still a member
- Annual release schedule + patches
- Most recent is R 4.4.1 (Jun 14)
- Being open source creates headaches for University & Business IT departments
- No guarantees of being virus free. Has inherent security flaws particularly using R data files (which are often part of packages)
- The community self-regulates
- Release Names are references to Peanuts cartoons
R Packages
- Packages are the key to R’s flexibility and power
- Collections (or libraries) of related functions
ggplot2
\(\implies\) Generating plots
edgeR
\(\implies\) Differential Gene Expression (DGE) for RNA-Seq
R Packages
- \(>\) 16,000 packages are stored on CRAN (https://cran.r-project.org)
- Not curated for statistical quality or documentation
- Automated testing for successful installation
. . .
- Bioconductor is a secondary repository (https://www.bioconductor.org)
- \(>\) 2,200 packages with a more biological/genomics focus
- Curated for language consistency & documentation
- The gg in ggplot2 stands for “Grammar of Graphics”
- Crap packages are generally identified by the users and then just not-used
- Statistical rigour is usually checked during review of the accompanying publication
Where is R used?
- Google, ATO, ABS etc
- Very large community of users in finance (Dirk Eddelbuettel -
Rcpp
)
- Genomics, Ecological Research, Public Health, Politics…
- Strong integration with HPC systems like Amazon, Hadoop
- Growing Machine Learning capacity
- Even has it’s own peer-reviewed Journal (The R Journal)
- I was offered a position by the ABS in 2022 specifically for my R skills to use analysing the housing market.
- BODL offered me a position the exact same day so I stayed in academia
Helpful Resources
Much of today is inspired by a two-day developers workshop I attended with Hadley Wickham. Also gave me an opportunity to have some great conversations with Winston Chang
Using R
The R Console
- This is often referred to as the
R Console
- At it’s simplest
R
is just a calculator (Press Enter)
R
has many standard functions
- We place the value inside the brackets after the function name
I never use a calculator program on my laptop, always R
The R Console
We can create objects with names
. . .
- We have just created an object called
x
- The
<-
symbol is like an arrow i.e. “put the value 5
into x
”
- Was a single key on keyboards in the 1970s
Object names can be anything but should start with a letter not a number or special character
The R Console
- View the contents of the object by entering it’s name in the
Console
- The object
x
only exists in the R Environment
. . .
- We can pass objects to functions and perform operations on them
The R Console
- Everything we’ve just done is trivial
- Real analysis isn’t
- If we perform a series of steps
- Should we keep a copy of what we’ve done?
- If so, how should we do that?
. . .
- A common strategy is to record our code as an R Script
R Studio
makes that easy & convenient
- Many people now use RMarkdown to combine analysis, results and figures
R Studio
Introduction to RStudio
R
and RStudio
are two separate but connected things
R
is like the engine of your car
. . .
RStudio
is the ‘cabin’ we use to control the engine
- Comes with extra features un-related to
R
that improve our ‘journey’
- Known as an IDE (Integrated Development Environment)
. . .
R
does all the calculations, manages the data, generates plots
- i.e. gets us to our destination
. . .
RStudio
helps manage our code, display the plots etc
- i.e. makes our journey easier to navigate
What is RStudio
- RStudio is product of a for profit company (Posit)
- RStudio (Desktop) is free
- RStudio Server has annual licence fee of $’000s
- Posit employs many of the best & brightest package developers
- e.g.
tidyverse
, bookdown
, reticulate
, roxygen2
etc.
- The CEO (JJ Allaire) is still an active developer
- Other IDEs also exist (e.g. emacs, VSCode)
- I remember being at the launch of RStudio (Coventry, 2011). It was a room full of R programmers thinking “holy crap, this changes everything”
- RStudio/Posit is a corporation whilst R is an academic-led volunteer community. So far relatively good relationship
- Heard JJ Allaire present some of his latest work a few weeks ago
Some very helpful features of RStudio
- We can write scripts and execute code interactively
- Predictive auto-completion
- We can see everything we need (directories, plots, code, history etc.)
. . .
- Use
R Projects
to manage each analysis
- Integration with other languages
- markdown, \(\LaTeX\), bash, python, C++, git etc.
- Numerous add-ons to simplify larger tasks
Important Setup
- Create a directory on your computer for today’s material
- We recommend
RAdelaide24
in your home directory
. . .
- Now open
RStudio
RStudio
will always open in a directory somewhere
- Look in the
Files
pane (bottom-right) to see where it’s looking
- This is also the working directory for
R
. . .
We want RStudio to be looking in our new directory (RAdelaide24
)
\(\implies\)R Projects make this easy
Create an R Project
(Not needed for any using the Posit cloud)
File
> New Project
> Existing Directory
- Browse to your
RAdelaide24
directory \(\implies\) Create Project
Create an R Project
- The
R Project
name is always the directory name
- Not essential, but good practice and extremely useful
- The Project Menu is in the top-right of RStudio
R Projects
are simply a wrapper for keeping an analysis organised
- Will always open in the R Project directory
- You can easily navigate to a directory with all scripts and data
- Makes managing file paths from your code very simple
- R Projects can be particularly helpful when loading external files
- Also when saving/exporting lots of files as part of your analysis
Create An Empty R Script
File
> New File
> R Script
- Save As
DataImport.R
RStudio
This is the basic layout we often work with
The Script Window
- This is just a text editor.
- We enter our commands here but they are not executed
- Forms a record of everything we’ve done
- Can repeat our analysis exactly
- We’ll return here later \(\implies\) but first a quick tour
The R Console
- This is the R Console within the RStudio IDE
- We’ve already explored this briefly
. . .
- In the same grouping we also have Terminal
- An approximation of a
bash
terminal (or PowerShell for Windows)
. . .
- Background Jobs shows progress when compiling RMarkdown & Quarto
The R Console
As well as performing simple calculations:
R
has what we call an Environment
(i.e. a Workspace)
- We can define objects here or import data
- Similar to a workbook in Excel with multiple worksheets
- Much more flexible & powerful
- Objects aren’t forced to be spreadsheets
- When we create a new sheet in Excel, we’re actually creating an object.
- Most commonly, it’s named Sheet1 or something similar
- Has fixed dimensions for memory management
The R Environment
Like we did earlier, in the R Console type:
. . .
Where have we created the object x
?
- Is it on your hard drive somewhere?
- Is it in a file somewhere?
. . .
- We have placed
x
in our R Environment
- Formally known as your
Global Environment

The R Environment
- The
R Environment
is like your desktop
- We keep all our relevant objects here
- Multiple objects are usually created during an analysis
- Can save all the objects in your environment as a single
.RData
object
R
can be set to automatically save your environment on exit
The History Tab
- Next to the Environment Tab is the History Tab
- Keeps a record of the last ~200 lines of code
- Very useful for remembering steps during exploration
- Best practice is to enter + execute code from the Script Window
. . .
- We can generally ignore the Connections and any other tabs
- A
git
tab will also appear for those who use git in their project
Accessing Help
- This will take you to the
Help
pane for the sqrt()
function
- Contents may look confusing at this point but will become clearer
. . .
- Many inbuilt functions are organised into a package called
base
- Packages group similar/related functions together
base
is always installed and loaded with R
- Click on the underlined word
Index
at the bottom for a list of functions in the base
packages
- Absolutely no need to learn any of these
- May be issues with
URL '/help/library/base/html/00Index.html' not found
- The examples in this help page are a bit rubbish…
Additional Sources For Help
- Help pages in
R
can be hit & miss
- Some are excellent and informative \(\implies\) some aren’t
- Bioconductor has a support forum for Bioconductor packages
- All packages have a vignette (again varying quality)
. . .
- Google is your friend \(\implies\) maybe ChatGPT?
As a package author, I’m always reading my own help pages. I simply can’t remember everything I’ve written
The Plots Pane
- We’ve already seen the Files pane
- Plots appear in the Plots pane
Other Panes
- The Packages Pane is a bad idea
- Can be disabled by popular request (I always do)
- Temptation to click is strong
- Very bad for reproducible research!!!
. . .
- Viewer Pane is used when compiling HTML documents from RMarkdown
. . .
- Every tab can be minimised/maximised using the buttons on the top right
- Window separators can be be moved to resize panes manually
Cheatsheet and Shortcuts
Help > Cheatsheets > RStudio IDE Cheat Sheet
Page 2 has lots of hints:
Ctrl + 1
places focus on the Script Window
Ctrl + 2
places focus on the Console
Ctrl + 3
places focus on the Help Tab
References
Chambers, John M. 1977. Computational Methods for Data Analysis. New York: Wiley.
———. 2020. “S, r, and Data Science.” Proc. ACM Program. Lang. 4 (HOPL): 1–17.