The Bioconductor Project

RAdelaide 2024

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 11, 2024

The Bioconductor Project

Today’s Outline

  • Not just to show how to perform an RNA-Seq analysis
  • Understanding common file types
  • Understanding Bioconductor objects, classes & methods
    • Has relevance beyond Bioconductor (e.g. map visualisation)
  • Key concepts & resources

. . .

  • Please ask questions!!! It’s the advantage of being here…

CRAN

  • A Package is a collection of functions
  • Associated with a given task/analysis/data-type
  • The main repository is “The Comprehensive R Archive Network” aka CRAN

. . .

  • We install packages using Tools > Install Packages...
    • Or by typing install.packages("pkg_name")
  • Will only install packages from CRAN

CRAN

  • Packages on CRAN cover everything & anything
    • GIS-based spatial data, stock trading, data visualisation etc
  • Some packages are biological in focus, e.g. Seurat for scRNA

. . .

  • Submission involves passing technical checks
    • Tests across OSX, Windows & Linux
    • Correct directory/file structures

Packages

  • Many packages are written by the authors for the authors
  • Decide to make them public in case others find them helpful

. . .

  • Both Chris & I use ngsReports nearly every day (still…)

Bioconductor

  • Bioconductor hosts packages focussed on biological research
    • www.bioconductor.org
    • Currently > 2,400 packages
  • Created by Robert Gentleman (one of the “R”s in R) in 2001

. . .

  • Packages checked for programming consistency \(\implies\) not methodology
    • Expected to integrate with other Bioconductor packages
  • All packages require a vignette explaining how to use the package
    • Checked manually for clarity/helpfulness during package submission
  • Packages tested nightly on OSX, Windows, Debian Linux + Arch Linux

Bioconductor

  • Essentially represents a community of developers & users
  • Strongly encourages & supports diversity
    • No longer on Twitter \(\implies\) not a safe space
  • Has a Code of Conduct
  • Consciously supporting those from emerging countries
    • Actively trying to translate resources to other languages
  • Also tries to find ways to engage with & support both developers & users
    • support.bioconductor.org & slack.bioconductor.org
    • Members of R Core regularly post on the Bioconductor slack

Bioconductor Structure

  • Core Team
    • All employed by Bioconductor
    • Primarily grant funded (NIH, NSF, CZI)
  • Scientific Advisory Board (SAB)
    • Meet Annually.
    • External and Internal leaders who act as project advisors.
  • Technical Advisory Board (TAB)
    • Meet monthly
    • Consider technical aspects of core infastructure and scientific direction of the project.
    • 15 members, 3 year term. Annual, open elections to rotate members.

Bioconductor Structure

Bioconductor Packages

  • R generally has bi-annual releases (R 4.4.0 April 24th, 2024)
    • Patch-fixes as needed \(\implies\) release 4.4.1 (14th Jun, 2024)
  • CRAN packages continually update

Bioconductor Packages

  • Bioconductor has two releases per year
  • Tested to a set R version (or later)
  • Bug-fixes also released as needed
    • Generally new features added to packages at Bioc releases
  • Latest Bioconductor release is 3.19 (May 1st 2024)
    • Was tested on R 4.4.0

Bioconductor Package Installation

  • BiocManager() is a CRAN package
  • Enables installation from CRAN and Bioconductor
    • Also handles package installation from github
BiocManager::install(c("pkg1", "user/pkg2"))

Bioconductor Packages

  • Many packages for specific analyses
    • DESeq2 & edgeR for bulk RNA-Seq Analysis
    • DiffBind & extraChIPs for ChIP-Seq Analysis
    • fgsea for GSEA within R

. . .

  • Also multiple packages define object classes & general methods
    • e.g. GenomicRanges for working with GRanges objects
    • Is a foundational class many other packages build on
  • New packages are expected to use existing classes where possible

Bioconductor Packages

Sequencing Alignment Reduction FASTQ BAM ShortReads, Biostrings, qrqc, ... Rsamtools, GenomicAlignments, ... Differential expression(genes, transcripts) edgeR, DESeq2,DEXSeq, SGSeq, ... Peaks (.bed, .wig) rtracklayer Annotation;Differential binding ChIPpeakAnno, DiffBind,ChIPseeker, csaw, ... Variants (.vcf) VariantAnnotation,VariantTools, h5vc,... Effect predictions; GWAS ChIPpeakAnno, DiffBind,ChIPseeker, csaw, ... ... ... IRanges,GenomicRanges,GenomicAlignments,... AnnotationDbi,GenomicFeatures,org.*, TxDb*,biomaRt, PSICQUIC,KEGGREST, ... Gviz, ggbio, epivisr,rtracklayer, SRAdb, ... Analysis Integration & Visualization Counts (.csv) ... ... Base R GenomicAlignments,Rsubread, ...

Taken from https://carpentries-incubator.github.io/bioc-project/02-introduction-to-bioconductor.html

Bioconductor Packages

  • The output from body(function_name) always has comments removed
  • All Bioconductor packages have the source code available
    • https://code.bioconductor.org
  • Code is always written as a set of R Scripts
  • Inside the R directory
    • Will retain comments & formatting from the authors
    • Sometimes more helpful…

BiocViews

  • BiocViews provide an overview of all Bioconductor packages
  • Can be very helpful when looking for a resource
  • Most packages are software (for analysis)
  • Also annotation packages:
    • Genome sequences, gene to GO mappings etc
  • Experimental data for demonstrating workflows
  • Workflow packages are slowly growing

. . .

browseVignettes()

Object Classes

Object Classes

R has two common types of objects

  • Built on top of (and including) vectors, lists etc.
  • S3 are very common & old (1970s)
    • Usually list-type objects e.g. results from lm() or t.test()
  • S4 introduced in ’90s
    • Focus on Object-Oriented Programming (OOP)
  • Biconductor packages rely heavily on S4 objects
    • Also common in spatial packages on CRAN (i.e. making maps etc)

Objects and Methods

  • Functions can be written to handle different types of input data
  • Figuring out which version of the function to use
    \(\implies\)method dispatch

. . .

  • A good example is the function summary()
    • Will return different results for a vector or data.frame
summary(letters)
summary(cars)

. . .

How does summary() know what to do for different data structures?

Objects and Methods

  • If we try to look at the code used in summary() it’s a bit odd
body(summary)
UseMethod("summary")

. . .

  • summary() uses different methods depending on the object class
  • Sometimes they’re hidden (I don’t know why…)

Objects and Methods

methods(summary)
 [1] summary.aov                         summary.aovlist*                   
 [3] summary.aspell*                     summary.check_packages_in_dir*     
 [5] summary.connection                  summary.data.frame                 
 [7] summary.Date                        summary.default                    
 [9] summary.ecdf*                       summary.factor                     
[11] summary.glm                         summary.infl*                      
[13] summary.lm                          summary.loess*                     
[15] summary.manova                      summary.matrix                     
[17] summary.mlm*                        summary.nls*                       
[19] summary.packageStatus*              summary.POSIXct                    
[21] summary.POSIXlt                     summary.ppr*                       
[23] summary.prcomp*                     summary.princomp*                  
[25] summary.proc_time                   summary.rlang_error*               
[27] summary.rlang_message*              summary.rlang_trace*               
[29] summary.rlang_warning*              summary.rlang:::list_of_conditions*
[31] summary.srcfile                     summary.srcref                     
[33] summary.stepfun                     summary.stl*                       
[35] summary.table                       summary.tukeysmooth*               
[37] summary.warnings                   
see '?methods' for accessing help and source code

The class is given after the dot Those marked with an asterisk are hidden

Objects and Methods

  • summary.data.frame()
    \(\implies\) used when summary() is called on a data.frame
  • summary.lm()
    \(\implies\) for an object of class lm (produced by lm())
  • summary.prcomp()
    \(\implies\) for an object of class prcomp (produced by prcomp())

. . .

  • If no method is written for a class \(\implies\) summary.default()
  • Look inside this using body(summary.default)
    • The last couple of lines were the output from summary(letters)

Objects and Methods

  • Can also see what methods exist for a given class
  • Before loading any packages ~56 methods exist for a data.frame
methods(class = "data.frame")
 [1] [             [[            [[<-          [<-           $<-           aggregate    
 [7] anyDuplicated anyNA         as.data.frame as.list       as.matrix     as.vector    
[13] by            cbind         coerce        dim           dimnames      dimnames<-   
[19] droplevels    duplicated    edit          format        formula       head         
[25] initialize    is.na         Math          merge         na.exclude    na.omit      
[31] Ops           plot          print         prompt        rbind         row.names    
[37] row.names<-   rowsum        show          slotsFromS3   sort_by       split        
[43] split<-       stack         str           subset        summary       Summary      
[49] t             tail          transform     type.convert  unique        unstack      
[55] within        xtfrm        
see '?methods' for accessing help and source code

Objects and Methods

  • Loading a new package will often introduce new methods
library(tidyverse)
methods(class = "data.frame")

. . .

  • Now we have ~170 methods for a data.frame

Objects and Methods

  • Most classes have a print() method
  • Determines what to print to the screen when calling an object
  • Most common use case for me is print(my_tbl, n = 20)
    • Can use to override the default number of rows printed
    • Calls print.tbl (which is hidden)

S3 Objects

  • Everything we’ve just seen applies to S3 objects
  • Very common class type (data.frame, list, htest, lm etc)

. . .

  • Sometimes classes have an explicit hierarchy
    • Best shown using is() instead of class()
is(band_members)
[1] "tbl_df"     "tbl"        "data.frame" "list"       "oldClass"   "vector"    

. . .

  • R looks for print.tbl_df() \(\rightarrow\) print.tbl() \(\rightarrow\) print.data.frame() etc
  • Will use the first one found
  • If none found \(\implies\) print.default()

S4 Objects

Many Bioconductor Packages define S4 objects

  • Very strict controls on data structure
  • Can be frustrating at first
  • Use the @ symbol for “slots” as well as $ for list elements
    • Slots are strictly defined components
  • Methods are also strictly defined

S4 Objects

  • Can be a little more challenging to interact with the tidyverse
  • Bioconductor pre-dates the tidyverse by > 10 years
  • tidyomics is an active area of Bioconductor development
    • Led by Stefano Mangiola from SAiGENCI

. . .

S4 Objects

  • Some packages use S4 implementations of S3 objects
    • data.frame (S3) Vs DataFrame (S4)
    • list (S3) Vs List (S4)
    • vector (S3) Vs Vector (S4)
    • rle (S3) Vs Rle (S4)
  • Many are written for memory efficiency
  • Look and behave similarly, but can ocassionally trip you over
    • Object may require a DataFrame and you have a data.frame
    • Coercion is usually relatively simple between base-level classes

Many S4 objects & methods were developed in the days when compute resources were limited

Rle Vectors

  • These are Run-Length Encoded vectors
library(S4Vectors)
test <- c(rep("X", 10), rep("Y", 5))
test
 [1] "X" "X" "X" "X" "X" "X" "X" "X" "X" "X" "Y" "Y" "Y" "Y" "Y"
Rle(test)
character-Rle of length 15 with 2 runs
  Lengths:  10   5
  Values : "X" "Y"
  • Can encode millions of chromosomes with minimal memory
    • Sorting can help keep memory useage down

Variations on data.frame Objects

data.frame Objects

  • data.frame
    1. Can set rownames
    2. Dumps all data to your screen
    3. Cannot have column names with spaces

. . .

  • tibble aka tbl_df
    1. rownames are always 1:nrow(df)
    2. Prints a summary with rownames hidden
    3. Column names with spaces permitted
    4. My preferred data.frame type

DataFrame objects

?DataFrame
  • An S4 version
    • Doesn’t work with the tidyverse (dplyr, ggplot2 etc)
    • Still missing from tidyomics
  • Until 2021 couldn’t coerce to a tibble directly
  • I hated that so wrote as_tibble() for DataFrame objects
    • In the package extraChIPs
    • Enables passing S4 objects to ggplot()
    • Please test & find any bugs I haven’t found yet

DataFrame objects

  • S3 Methods from dplyr will not work on DataFrame objects
  • Some equivalents exist (most pre-date the tidyverse)
    • subset() pre-dates dplyr::filter()
    • rbind() and combineRows() \(\implies\) bind_rows()
    • cbind(), combineCols() and merge() \(\implies\) joins
    • sort() \(\implies\) arrange()
    • unique() \(\implies\) distinct()
  • No simple equivalent for mutate(), summarise(), across(), pivot_*()

DataFrame objects

  • Can have columns of lists (so can tbl_df objects)
    • e.g. a CharacterList() from IRanges
    • S4 lists can be typed \(\implies\) memory efficiency
    • List objects can exist in a compressed form \(\implies\) memory efficiency
  • DataFrame objects can have S4 objects as columns
    • S3 data frames (including tibbles) cannot

By typing a list we only need to record the type once, instead of once for each element. Can make a big difference with large objects

DataFrame objects

library(IRanges)
genes <- c("A", "B")
transcripts <- CharacterList(
  c("A1", "A2", "A3"), c("B1", "B2")
)
transcripts
CharacterList of length 2
[[1]] A1 A2 A3
[[2]] B1 B2

. . .

DF <- DataFrame(Gene = genes, Transcripts = transcripts)
DF
DataFrame with 2 rows and 2 columns
         Gene     Transcripts
  <character> <CharacterList>
1           A        A1,A2,A3
2           B           B1,B2

. . .

library(extraChIPs)
as_tibble(DF)
# A tibble: 2 × 2
  Gene  Transcripts
  <chr> <list>     
1 A     <chr [3]>  
2 B     <chr [2]>  

DataFrame objects

  • Object-level metadata can be added to DataFrame objects
  • Must be a list
metadata(DF) <- list(details = "Created for RAdelaide 2024")
glimpse(DF)
Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  ..@ rownames       : NULL
  ..@ nrows          : int 2
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata: NULL
  ..@ metadata       :List of 1
  .. ..$ details: chr "Created for RAdelaide 2024"
  ..@ listData       :List of 2
  .. ..$ Gene       : chr [1:2] "A" "B"
  .. ..$ Transcripts:Formal class 'CompressedCharacterList' [package "IRanges"] with 5 slots

. . .

  • Notice where rownames are and how the “columns” are stored

Point out the @ structure

DataFrame objects

  • Also enable the addition of column-specific metadata \(\implies\) mcols()
    • Used heavily in Bioconductor
mcols(DF) <- DataFrame(meta =  c("Made-up genes", "Made-up transcripts"))
mcols(DF)
DataFrame with 2 rows and 1 column
                           meta
                    <character>
Gene              Made-up genes
Transcripts Made-up transcripts

. . .

glimpse(DF) # This is in the @elementMetadata slot
Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  ..@ rownames       : NULL
  ..@ nrows          : int 2
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  ..@ metadata       :List of 1
  .. ..$ details: chr "Created for RAdelaide 2024"
  ..@ listData       :List of 2
  .. ..$ Gene       : chr [1:2] "A" "B"
  .. ..$ Transcripts:Formal class 'CompressedCharacterList' [package "IRanges"] with 5 slots

S4 Object Structure

  • S4 objects have slots denoted with @
    • Subtly different to list elements
  • These are fixed and mandatory for every S4 class
    • Can still be empty (NULL) objects
    • Can be S3 or S4 objects
  • Enforces a strict structure with checks as objects formed
    • Saves time performing checks within functions
    • Makes structure strict, rigid & hard to break

S3 objects are easy to break. Just change the class attribute…

S4 Object Structure

  • We can’t lapply our way through these objects
  • Can access each slot using shortcuts object@slotName
    • More formally using slot(object, "slotName")
DF@listData
$Gene
[1] "A" "B"

$Transcripts
CharacterList of length 2
[[1]] A1 A2 A3
[[2]] B1 B2
slot(DF, "listData")
$Gene
[1] "A" "B"

$Transcripts
CharacterList of length 2
[[1]] A1 A2 A3
[[2]] B1 B2

S4 Object Structure

  • Slot names can be found using slotNames(object)
slotNames(DF)
[1] "rownames"        "nrows"           "elementType"     "elementMetadata" "metadata"       
[6] "listData"       

. . .

  • The full type description can also be found
getSlots("DFrame")
           rownames               nrows         elementType     elementMetadata            metadata 
"character_OR_NULL"           "integer"         "character" "DataFrame_OR_NULL"              "list" 
           listData 
             "list" 

S4 Methods

  • S3 method dispatch uses the method.class syntax
  • S4 is very different but has some similarities

. . .

  • S4 objects almost always have hierarchical classes
    • Increasingly common for S3 objects
    • Each level extends the lower level class

. . .

  • Methods are strictly defined by package authors
    • A Generic function must be defined for each method/class
    • The hierarchy is traversed until a method is found

S4 Methods

  • To check the class hierarchy of an object use is()
is(DF)
 [1] "DFrame"            "DataFrame"         "SimpleList"        "RectangularData"  
 [5] "List"              "DataFrame_OR_NULL" "Vector"            "list_OR_List"     
 [9] "Annotated"         "vector_OR_Vector" 

. . .

  • Logical tests can be applied
is(DF, "DataFrame")
[1] TRUE
is(DF, "data.frame")
[1] FALSE

. . .

methods(class = "DataFrame")

S4 Methods

  • The function body() will return standardGeneric()
    • Slightly different to UseMethod()

. . .

  • To show code within a function
getMethod(f = "nrow", signature = "DataFrame")
Method Definition:

function (x) 
x@nrows
<bytecode: 0x56387f2964f0>
<environment: namespace:S4Vectors>

Signatures:
        x          
target  "DataFrame"
defined "DataFrame"

Recap

  • The Bioconductor Project extends back to the early days of R
    • Also the early days of bioinformatics
    • Has a genuine community aspect . . .
  • S4 object classes are common
    • Foundational classes enabling package inter-operability
    • Much less common in CRAN packages (spatial/GIS)
  • Method dispatch is handled differently

. . .

  • Can play badly with the tidyverse
    • An area of active development