The Bioconductor Project

RAdelaide 2024

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 11, 2024

The Bioconductor Project

Today’s Outline

  • Not just to show how to perform an RNA-Seq analysis
  • Understanding common file types
  • Understanding Bioconductor objects, classes & methods
    • Has relevance beyond Bioconductor (e.g. map visualisation)
  • Key concepts & resources
  • Please ask questions!!! It’s the advantage of being here…

CRAN

  • A Package is a collection of functions
  • Associated with a given task/analysis/data-type
  • The main repository is “The Comprehensive R Archive Network” aka CRAN
  • We install packages using Tools > Install Packages...
    • Or by typing install.packages("pkg_name")
  • Will only install packages from CRAN

CRAN

  • Packages on CRAN cover everything & anything
    • GIS-based spatial data, stock trading, data visualisation etc
  • Some packages are biological in focus, e.g. Seurat for scRNA
  • Submission involves passing technical checks
    • Tests across OSX, Windows & Linux
    • Correct directory/file structures

Packages

  • Many packages are written by the authors for the authors
  • Decide to make them public in case others find them helpful
  • Both Chris & I use ngsReports nearly every day (still…)

Bioconductor

  • Bioconductor hosts packages focussed on biological research
    • www.bioconductor.org
    • Currently > 2,400 packages
  • Created by Robert Gentleman (one of the “R”s in R) in 2001
  • Packages checked for programming consistency \(\implies\) not methodology
    • Expected to integrate with other Bioconductor packages
  • All packages require a vignette explaining how to use the package
    • Checked manually for clarity/helpfulness during package submission
  • Packages tested nightly on OSX, Windows, Debian Linux + Arch Linux

Bioconductor

  • Essentially represents a community of developers & users
  • Strongly encourages & supports diversity
    • No longer on Twitter \(\implies\) not a safe space
  • Has a Code of Conduct
  • Consciously supporting those from emerging countries
    • Actively trying to translate resources to other languages
  • Also tries to find ways to engage with & support both developers & users
    • support.bioconductor.org & slack.bioconductor.org
    • Members of R Core regularly post on the Bioconductor slack

Bioconductor Structure

  • Core Team
    • All employed by Bioconductor
    • Primarily grant funded (NIH, NSF, CZI)
  • Scientific Advisory Board (SAB)
    • Meet Annually.
    • External and Internal leaders who act as project advisors.
  • Technical Advisory Board (TAB)
    • Meet monthly
    • Consider technical aspects of core infastructure and scientific direction of the project.
    • 15 members, 3 year term. Annual, open elections to rotate members.

Bioconductor Structure

Bioconductor Packages

  • R generally has bi-annual releases (R 4.4.0 April 24th, 2024)
    • Patch-fixes as needed \(\implies\) release 4.4.1 (14th Jun, 2024)
  • CRAN packages continually update

Bioconductor Packages

  • Bioconductor has two releases per year
  • Tested to a set R version (or later)
  • Bug-fixes also released as needed
    • Generally new features added to packages at Bioc releases
  • Latest Bioconductor release is 3.19 (May 1st 2024)
    • Was tested on R 4.4.0

Bioconductor Package Installation

  • BiocManager() is a CRAN package
  • Enables installation from CRAN and Bioconductor
    • Also handles package installation from github
BiocManager::install(c("pkg1", "user/pkg2"))

Bioconductor Packages

  • Many packages for specific analyses
    • DESeq2 & edgeR for bulk RNA-Seq Analysis
    • DiffBind & extraChIPs for ChIP-Seq Analysis
    • fgsea for GSEA within R
  • Also multiple packages define object classes & general methods
    • e.g. GenomicRanges for working with GRanges objects
    • Is a foundational class many other packages build on
  • New packages are expected to use existing classes where possible

Bioconductor Packages

Taken from https://carpentries-incubator.github.io/bioc-project/02-introduction-to-bioconductor.html

Bioconductor Packages

  • The output from body(function_name) always has comments removed
  • All Bioconductor packages have the source code available
    • https://code.bioconductor.org
  • Code is always written as a set of R Scripts
  • Inside the R directory
    • Will retain comments & formatting from the authors
    • Sometimes more helpful…

BiocViews

  • BiocViews provide an overview of all Bioconductor packages
  • Can be very helpful when looking for a resource
  • Most packages are software (for analysis)
  • Also annotation packages:
    • Genome sequences, gene to GO mappings etc
  • Experimental data for demonstrating workflows
  • Workflow packages are slowly growing

browseVignettes()

Object Classes

Object Classes

R has two common types of objects

  • Built on top of (and including) vectors, lists etc.
  • S3 are very common & old (1970s)
    • Usually list-type objects e.g. results from lm() or t.test()
  • S4 introduced in ’90s
    • Focus on Object-Oriented Programming (OOP)
  • Biconductor packages rely heavily on S4 objects
    • Also common in spatial packages on CRAN (i.e. making maps etc)

Objects and Methods

  • Functions can be written to handle different types of input data
  • Figuring out which version of the function to use
    \(\implies\)method dispatch
  • A good example is the function summary()
    • Will return different results for a vector or data.frame
summary(letters)
summary(cars)

How does summary() know what to do for different data structures?

Objects and Methods

  • If we try to look at the code used in summary() it’s a bit odd
body(summary)
UseMethod("summary")
  • summary() uses different methods depending on the object class
  • Sometimes they’re hidden (I don’t know why…)

Objects and Methods

methods(summary)
 [1] summary.aov                         summary.aovlist*                   
 [3] summary.aspell*                     summary.check_packages_in_dir*     
 [5] summary.connection                  summary.data.frame                 
 [7] summary.Date                        summary.default                    
 [9] summary.ecdf*                       summary.factor                     
[11] summary.glm                         summary.infl*                      
[13] summary.lm                          summary.loess*                     
[15] summary.manova                      summary.matrix                     
[17] summary.mlm*                        summary.nls*                       
[19] summary.packageStatus*              summary.POSIXct                    
[21] summary.POSIXlt                     summary.ppr*                       
[23] summary.prcomp*                     summary.princomp*                  
[25] summary.proc_time                   summary.rlang_error*               
[27] summary.rlang_message*              summary.rlang_trace*               
[29] summary.rlang_warning*              summary.rlang:::list_of_conditions*
[31] summary.srcfile                     summary.srcref                     
[33] summary.stepfun                     summary.stl*                       
[35] summary.table                       summary.tukeysmooth*               
[37] summary.warnings                   
see '?methods' for accessing help and source code

Objects and Methods

  • summary.data.frame()
    \(\implies\) used when summary() is called on a data.frame
  • summary.lm()
    \(\implies\) for an object of class lm (produced by lm())
  • summary.prcomp()
    \(\implies\) for an object of class prcomp (produced by prcomp())
  • If no method is written for a class \(\implies\) summary.default()
  • Look inside this using body(summary.default)
    • The last couple of lines were the output from summary(letters)

Objects and Methods

  • Can also see what methods exist for a given class
  • Before loading any packages ~56 methods exist for a data.frame
methods(class = "data.frame")
 [1] [             [[            [[<-          [<-           $<-           aggregate    
 [7] anyDuplicated anyNA         as.data.frame as.list       as.matrix     as.vector    
[13] by            cbind         coerce        dim           dimnames      dimnames<-   
[19] droplevels    duplicated    edit          format        formula       head         
[25] initialize    is.na         Math          merge         na.exclude    na.omit      
[31] Ops           plot          print         prompt        rbind         row.names    
[37] row.names<-   rowsum        show          slotsFromS3   sort_by       split        
[43] split<-       stack         str           subset        summary       Summary      
[49] t             tail          transform     type.convert  unique        unstack      
[55] within        xtfrm        
see '?methods' for accessing help and source code

Objects and Methods

  • Loading a new package will often introduce new methods
library(tidyverse)
methods(class = "data.frame")
  • Now we have ~170 methods for a data.frame

Objects and Methods

  • Most classes have a print() method
  • Determines what to print to the screen when calling an object
  • Most common use case for me is print(my_tbl, n = 20)
    • Can use to override the default number of rows printed
    • Calls print.tbl (which is hidden)

S3 Objects

  • Everything we’ve just seen applies to S3 objects
  • Very common class type (data.frame, list, htest, lm etc)
  • Sometimes classes have an explicit hierarchy
    • Best shown using is() instead of class()
is(band_members)
[1] "tbl_df"     "tbl"        "data.frame" "list"       "oldClass"   "vector"    
  • R looks for print.tbl_df() \(\rightarrow\) print.tbl() \(\rightarrow\) print.data.frame() etc
  • Will use the first one found
  • If none found \(\implies\) print.default()

S4 Objects

Many Bioconductor Packages define S4 objects

  • Very strict controls on data structure
  • Can be frustrating at first
  • Use the @ symbol for “slots” as well as $ for list elements
    • Slots are strictly defined components
  • Methods are also strictly defined

S4 Objects

  • Can be a little more challenging to interact with the tidyverse
  • Bioconductor pre-dates the tidyverse by > 10 years
  • tidyomics is an active area of Bioconductor development
    • Led by Stefano Mangiola from SAiGENCI

S4 Objects

  • Some packages use S4 implementations of S3 objects
    • data.frame (S3) Vs DataFrame (S4)
    • list (S3) Vs List (S4)
    • vector (S3) Vs Vector (S4)
    • rle (S3) Vs Rle (S4)
  • Many are written for memory efficiency
  • Look and behave similarly, but can ocassionally trip you over
    • Object may require a DataFrame and you have a data.frame
    • Coercion is usually relatively simple between base-level classes

Rle Vectors

  • These are Run-Length Encoded vectors
library(S4Vectors)
test <- c(rep("X", 10), rep("Y", 5))
test
 [1] "X" "X" "X" "X" "X" "X" "X" "X" "X" "X" "Y" "Y" "Y" "Y" "Y"
Rle(test)
character-Rle of length 15 with 2 runs
  Lengths:  10   5
  Values : "X" "Y"
  • Can encode millions of chromosomes with minimal memory
    • Sorting can help keep memory useage down

Variations on data.frame Objects

data.frame Objects

  • data.frame
    1. Can set rownames
    2. Dumps all data to your screen
    3. Cannot have column names with spaces
  • tibble aka tbl_df
    1. rownames are always 1:nrow(df)
    2. Prints a summary with rownames hidden
    3. Column names with spaces permitted
    4. My preferred data.frame type

DataFrame objects

?DataFrame
  • An S4 version
    • Doesn’t work with the tidyverse (dplyr, ggplot2 etc)
    • Still missing from tidyomics
  • Until 2021 couldn’t coerce to a tibble directly
  • I hated that so wrote as_tibble() for DataFrame objects
    • In the package extraChIPs
    • Enables passing S4 objects to ggplot()
    • Please test & find any bugs I haven’t found yet

DataFrame objects

  • S3 Methods from dplyr will not work on DataFrame objects
  • Some equivalents exist (most pre-date the tidyverse)
    • subset() pre-dates dplyr::filter()
    • rbind() and combineRows() \(\implies\) bind_rows()
    • cbind(), combineCols() and merge() \(\implies\) joins
    • sort() \(\implies\) arrange()
    • unique() \(\implies\) distinct()
  • No simple equivalent for mutate(), summarise(), across(), pivot_*()

DataFrame objects

  • Can have columns of lists (so can tbl_df objects)
    • e.g. a CharacterList() from IRanges
    • S4 lists can be typed \(\implies\) memory efficiency
    • List objects can exist in a compressed form \(\implies\) memory efficiency
  • DataFrame objects can have S4 objects as columns
    • S3 data frames (including tibbles) cannot

DataFrame objects

library(IRanges)
genes <- c("A", "B")
transcripts <- CharacterList(
  c("A1", "A2", "A3"), c("B1", "B2")
)
transcripts
CharacterList of length 2
[[1]] A1 A2 A3
[[2]] B1 B2
DF <- DataFrame(Gene = genes, Transcripts = transcripts)
DF
DataFrame with 2 rows and 2 columns
         Gene     Transcripts
  <character> <CharacterList>
1           A        A1,A2,A3
2           B           B1,B2
library(extraChIPs)
as_tibble(DF)
# A tibble: 2 × 2
  Gene  Transcripts
  <chr> <list>     
1 A     <chr [3]>  
2 B     <chr [2]>  

DataFrame objects

  • Object-level metadata can be added to DataFrame objects
  • Must be a list
metadata(DF) <- list(details = "Created for RAdelaide 2024")
glimpse(DF)
Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  ..@ rownames       : NULL
  ..@ nrows          : int 2
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata: NULL
  ..@ metadata       :List of 1
  .. ..$ details: chr "Created for RAdelaide 2024"
  ..@ listData       :List of 2
  .. ..$ Gene       : chr [1:2] "A" "B"
  .. ..$ Transcripts:Formal class 'CompressedCharacterList' [package "IRanges"] with 5 slots
  • Notice where rownames are and how the “columns” are stored

DataFrame objects

  • Also enable the addition of column-specific metadata \(\implies\) mcols()
    • Used heavily in Bioconductor
mcols(DF) <- DataFrame(meta =  c("Made-up genes", "Made-up transcripts"))
mcols(DF)
DataFrame with 2 rows and 1 column
                           meta
                    <character>
Gene              Made-up genes
Transcripts Made-up transcripts
glimpse(DF) # This is in the @elementMetadata slot
Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  ..@ rownames       : NULL
  ..@ nrows          : int 2
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  ..@ metadata       :List of 1
  .. ..$ details: chr "Created for RAdelaide 2024"
  ..@ listData       :List of 2
  .. ..$ Gene       : chr [1:2] "A" "B"
  .. ..$ Transcripts:Formal class 'CompressedCharacterList' [package "IRanges"] with 5 slots

S4 Object Structure

  • S4 objects have slots denoted with @
    • Subtly different to list elements
  • These are fixed and mandatory for every S4 class
    • Can still be empty (NULL) objects
    • Can be S3 or S4 objects
  • Enforces a strict structure with checks as objects formed
    • Saves time performing checks within functions
    • Makes structure strict, rigid & hard to break

S4 Object Structure

  • We can’t lapply our way through these objects
  • Can access each slot using shortcuts object@slotName
    • More formally using slot(object, "slotName")
DF@listData
$Gene
[1] "A" "B"

$Transcripts
CharacterList of length 2
[[1]] A1 A2 A3
[[2]] B1 B2
slot(DF, "listData")
$Gene
[1] "A" "B"

$Transcripts
CharacterList of length 2
[[1]] A1 A2 A3
[[2]] B1 B2

S4 Object Structure

  • Slot names can be found using slotNames(object)
slotNames(DF)
[1] "rownames"        "nrows"           "elementType"     "elementMetadata" "metadata"       
[6] "listData"       
  • The full type description can also be found
getSlots("DFrame")
           rownames               nrows         elementType     elementMetadata            metadata 
"character_OR_NULL"           "integer"         "character" "DataFrame_OR_NULL"              "list" 
           listData 
             "list" 

S4 Methods

  • S3 method dispatch uses the method.class syntax
  • S4 is very different but has some similarities
  • S4 objects almost always have hierarchical classes
    • Increasingly common for S3 objects
    • Each level extends the lower level class
  • Methods are strictly defined by package authors
    • A Generic function must be defined for each method/class
    • The hierarchy is traversed until a method is found

S4 Methods

  • To check the class hierarchy of an object use is()
is(DF)
 [1] "DFrame"            "DataFrame"         "SimpleList"        "RectangularData"  
 [5] "List"              "DataFrame_OR_NULL" "Vector"            "list_OR_List"     
 [9] "Annotated"         "vector_OR_Vector" 
  • Logical tests can be applied
is(DF, "DataFrame")
[1] TRUE
is(DF, "data.frame")
[1] FALSE
methods(class = "DataFrame")

S4 Methods

  • The function body() will return standardGeneric()
    • Slightly different to UseMethod()
  • To show code within a function
getMethod(f = "nrow", signature = "DataFrame")
Method Definition:

function (x) 
x@nrows
<bytecode: 0x55649b071c30>
<environment: namespace:S4Vectors>

Signatures:
        x          
target  "DataFrame"
defined "DataFrame"

Recap

  • The Bioconductor Project extends back to the early days of R
    • Also the early days of bioinformatics
    • Has a genuine community aspect . . .
  • S4 object classes are common
    • Foundational classes enabling package inter-operability
    • Much less common in CRAN packages (spatial/GIS)
  • Method dispatch is handled differently
  • Can play badly with the tidyverse
    • An area of active development