Differential Gene Expression

RAdelaide 2024

Author

Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 11, 2024

Differential Gene Expression

The most common question:
Which genes change expression levels in response to a treatment?

. . .

Need to perform a statistical test on each gene
- Select genes-of-interest using some criteria

. . .

Dealing with count data \(\implies\) can’t be Normally distributed
Generally assumed to have a Negative Binomial distribution
- Essentially a Poisson distribution with additional variability

Today’s Data

Data obtained from Gene Expression Omnibus GSE171742
Stem cell derived cardiomyocytes with 4 treatment groups (Liu et al. 2023)
All transfected using lentiviral vectors

. . .

Control
Nsp6 over-expression
Nsp8 over-expression
M over-expression

Today’s Data

Today’s Data

Well documented metadata! 🤓
We know exact versions of all tools and annotations

Loading Packages

library(tidyverse)
library(magrittr)
library(edgeR)
library(AnnotationHub)
library(rtracklayer)
library(plyranges)
library(patchwork)
library(scales)
library(glue)
library(ggrepel)
library(pheatmap)
library(parallel)
theme_set(theme_bw())

Getting Annotations

Which one shall we choose?

ah <- AnnotationHub()
ah %>% subset(dataprovider == "Gencode" & genome == "GRCh38") %>% query("v31")

AnnotationHub with 9 records
# snapshotDate(): 2024-04-30
# $dataprovider: Gencode
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
#   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH75118"]]' 

            title                                                    
  AH75118 | gencode.v31.2wayconspseudos.gff3.gz                      
  AH75119 | gencode.v31.annotation.gff3.gz                           
  AH75120 | gencode.v31.basic.annotation.gff3.gz                     
  AH75121 | gencode.v31.chr_patch_hapl_scaff.annotation.gff3.gz      
  AH75122 | gencode.v31.chr_patch_hapl_scaff.basic.annotation.gff3.gz
  AH75123 | gencode.v31.long_noncoding_RNAs.gff3.gz                  
  AH75124 | gencode.v31.polyAs.gff3.gz                               
  AH75125 | gencode.v31.primary_assembly.annotation.gff3.gz          
  AH75126 | gencode.v31.tRNAs.gff3.gz

Getting Annotations

gtf <- ah[["AH75121"]] # This will take several minutes
genes <- gtf %>% 
  filter(type == "gene") %>% 
  select(starts_with("gene"))
genes

GRanges object with 66738 ranges and 3 metadata columns:
            seqnames        ranges strand |           gene_id              gene_type   gene_name
               <Rle>     <IRanges>  <Rle> |       <character>            <character> <character>
      [1]       chr1   11869-14409      + | ENSG00000223972.5 transcribed_unproces..     DDX11L1
      [2]       chr1   14404-29570      - | ENSG00000227232.5 unprocessed_pseudogene      WASH7P
      [3]       chr1   17369-17436      - | ENSG00000278267.1                  miRNA   MIR6859-1
      [4]       chr1   29554-31109      + | ENSG00000243485.5                 lncRNA MIR1302-2HG
      [5]       chr1   30366-30503      + | ENSG00000284332.1                  miRNA   MIR1302-2
      ...        ...           ...    ... .               ...                    ...         ...
  [66734] KI270734.1   72411-74814      + | ENSG00000276017.1         protein_coding  AC007325.1
  [66735] KI270734.1 131494-137392      + | ENSG00000278817.1         protein_coding  AC007325.4
  [66736] KI270734.1 138082-161852      - | ENSG00000277196.4         protein_coding  AC007325.2
  [66737] KI270744.1   51009-51114      - | ENSG00000278625.1                  snRNA     RF00026
  [66738] KI270750.1 148668-148843      + | ENSG00000277374.1                  snRNA     RF00003
  -------
  seqinfo: 407 sequences from an unspecified genome; no seqlengths

Getting Gene Lengths

The simplest way is to add the width of all exonic regions

gene_lengths <- gtf %>% 
  filter(type == "exon") %>% 
  split(.$gene_id) %>% 
  reduce() %>% 
  width() %>% 
  mclapply(sum, mc.cores = 4) %>% 
  unlist()
genes$length <- gene_lengths[genes$gene_id]

Loading Counts

Data is exactly as produced by featureCounts (Liao, Smyth, and Shi 2014)
- Tab-delimited but with some weird columns
Let’s have a sneak preview

. . .

read_tsv("data/GSE171742_counts.out.gz", n_max = 6)

# A tibble: 6 × 18
  Geneid      Chr   Start End   Strand Length `1_control` `6_control` `11_control` `2_Nsp6` `7_Nsp6`
  <chr>       <chr> <chr> <chr> <chr>   <dbl>       <dbl>       <dbl>        <dbl>    <dbl>    <dbl>
1 ENSG000000… chrX… 1006… 1006… -;-;-…   4535        5211        6192         4542     4328     5283
2 ENSG000000… chrX… 1005… 1005… +;+;+…   1476          65         178           96       50       86
3 ENSG000000… chr2… 5093… 5093… -;-;-…   1207        1454        1889         1329     1472     1684
4 ENSG000000… chr1… 1698… 1698… -;-;-…   6883         628         548          522      560      636
5 ENSG000000… chr1… 1696… 1696… +;+;+…   5970         191         241          150      146      254
6 ENSG000000… chr1… 2761… 2761… -;-;-…   3382           0           0            0        0        0
# ℹ 7 more variables: `12_Nsp6` <dbl>, `3_Nsp8` <dbl>, `8_Nsp8` <dbl>, `13_Nsp8` <dbl>,
#   `4_M` <dbl>, `9_M` <dbl>, `14_M` <dbl>

. . .

Which columns do we need? How would you parse them?

Loading Counts

The best form for counts is as a matrix

counts <- read_tsv("data/GSE171742_counts.out.gz") %>% 
  dplyr::select(Geneid, contains("_")) %>% 
  as.data.frame() %>% 
  column_to_rownames("Geneid") %>% 
  as.matrix()
glimpse(counts)

 num [1:66738, 1:12] 5211 65 1454 628 191 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:66738] "ENSG00000000003.14" "ENSG00000000005.6" "ENSG00000000419.12" "ENSG00000000457.14" ...
  ..$ : chr [1:12] "1_control" "6_control" "11_control" "2_Nsp6" ...

Checking Annotations

Compare these with the annotations we have

setdiff(rownames(counts), genes$gene_id)

 [1] "ENSG00000002586.20_PAR_Y" "ENSG00000124333.16_PAR_Y" "ENSG00000124334.17_PAR_Y"
 [4] "ENSG00000167393.17_PAR_Y" "ENSG00000168939.11_PAR_Y" "ENSG00000169084.14_PAR_Y"
 [7] "ENSG00000169093.16_PAR_Y" "ENSG00000169100.14_PAR_Y" "ENSG00000178605.13_PAR_Y"
[10] "ENSG00000182162.11_PAR_Y" "ENSG00000182378.14_PAR_Y" "ENSG00000182484.15_PAR_Y"
[13] "ENSG00000185203.12_PAR_Y" "ENSG00000185291.11_PAR_Y" "ENSG00000185960.14_PAR_Y"
[16] "ENSG00000196433.13_PAR_Y" "ENSG00000197976.12_PAR_Y" "ENSG00000198223.16_PAR_Y"
[19] "ENSG00000205755.11_PAR_Y" "ENSG00000214717.12_PAR_Y" "ENSG00000223274.6_PAR_Y" 
[22] "ENSG00000223484.7_PAR_Y"  "ENSG00000223511.7_PAR_Y"  "ENSG00000223571.6_PAR_Y" 
[25] "ENSG00000223773.7_PAR_Y"  "ENSG00000225661.7_PAR_Y"  "ENSG00000226179.6_PAR_Y" 
[28] "ENSG00000227159.8_PAR_Y"  "ENSG00000228410.6_PAR_Y"  "ENSG00000228572.7_PAR_Y" 
[31] "ENSG00000229232.6_PAR_Y"  "ENSG00000230542.6_PAR_Y"  "ENSG00000234622.6_PAR_Y" 
[34] "ENSG00000234958.6_PAR_Y"  "ENSG00000236017.8_PAR_Y"  "ENSG00000236871.7_PAR_Y" 
[37] "ENSG00000237040.6_PAR_Y"  "ENSG00000237531.6_PAR_Y"  "ENSG00000237801.6_PAR_Y" 
[40] "ENSG00000265658.6_PAR_Y"  "ENSG00000270726.6_PAR_Y"  "ENSG00000275287.5_PAR_Y" 
[43] "ENSG00000277120.5_PAR_Y"  "ENSG00000280767.3_PAR_Y"  "ENSG00000281849.3_PAR_Y"

If we’re doing everything, we don’t need to check these
Only for public data. Annotations can be difficult to untangle

Should we remove the suffixes?

Checking Annotations

genes %>% subset(gene_id == "ENSG00000002586.20")

GRanges object with 2 ranges and 4 metadata columns:
      seqnames          ranges strand |            gene_id      gene_type   gene_name    length
         <Rle>       <IRanges>  <Rle> |        <character>    <character> <character> <integer>
  [1]     chrX 2691187-2741309      + | ENSG00000002586.20 protein_coding        CD99      9716
  [2]     chrY 2691187-2741309      + | ENSG00000002586.20 protein_coding        CD99      9716
  -------
  seqinfo: 407 sequences from an unspecified genome; no seqlengths

How can we resolve this?

Checking Annotations

genes <- genes %>% 
  mutate(
    gene_id = case_when(
      duplicated(gene_id) ~ paste0(gene_id, "_PAR_Y"),
      TRUE ~ gene_id
    )
  ) 
subset(genes, duplicated(gene_id))

GRanges object with 0 ranges and 4 metadata columns:
   seqnames    ranges strand |     gene_id   gene_type   gene_name    length
      <Rle> <IRanges>  <Rle> | <character> <character> <character> <integer>
  -------
  seqinfo: 407 sequences from an unspecified genome; no seqlengths

DGE Analysis

Basic Workflow

Remove low-expressed & undetectable genes
Normalise the data
Estimate dispersions
Perform Statistical Tests
Enrichment Testing

. . .

Careful checks at every step to inform decisions

DGEList Objects

We’ll use edgeR for DGE analysis (Robinson, McCarthy, and Smyth 2010)
Counts, sample metadata and gene annotations stored in a single S4 object
- DGEList: Digital Gene Expression List
- Gene & metadata elements need to be data.frame objects
- Will be coerced if able . . .
DESeq2 is a common alternative (Love, Huber, and Anders 2014)
Uses an extension of SummarizedExperiment objects

Both methods are gold-standard but neither perfect for every dataset
Both Wolfgang Huber & Gordon Smyth are exceptional statisticians
Will try get to DESeq2 later in the session

DGEList Objects

How can we get sample metadata?

samples <- tibble(id = colnames(counts)) %>% 
  separate(id, into = c("replicate", "treatment"), remove = FALSE) %>% 
  mutate(
    treatment = as.factor(treatment),
    group = as.integer(treatment)
  )
samples

# A tibble: 12 × 4
   id         replicate treatment group
   <chr>      <chr>     <fct>     <int>
 1 1_control  1         control       1
 2 6_control  6         control       1
 3 11_control 11        control       1
 4 2_Nsp6     2         Nsp6          3
 5 7_Nsp6     7         Nsp6          3
 6 12_Nsp6    12        Nsp6          3
 7 3_Nsp8     3         Nsp8          4
 8 8_Nsp8     8         Nsp8          4
 9 13_Nsp8    13        Nsp8          4
10 4_M        4         M             2
11 9_M        9         M             2
12 14_M       14        M             2

Will be in the same order as counts

DGEList Objects

Annotations appear to match the counts
- Counts for 66,738 genes
- Annotations for 66,738 genes
But will be in a different order
- Counts in order of name
- Annotations in genomic order

DGEList Objects

genes %>% setNames(.$gene_id) %>% .[rownames(counts)]

GRanges object with 66738 ranges and 4 metadata columns:
                     seqnames              ranges strand |            gene_id      gene_type
                        <Rle>           <IRanges>  <Rle> |        <character>    <character>
  ENSG00000000003.14     chrX 100627109-100639991      - | ENSG00000000003.14 protein_coding
   ENSG00000000005.6     chrX 100584936-100599885      + |  ENSG00000000005.6 protein_coding
  ENSG00000000419.12    chr20   50934867-50958555      - | ENSG00000000419.12 protein_coding
  ENSG00000000457.14     chr1 169849631-169894267      - | ENSG00000000457.14 protein_coding
  ENSG00000000460.17     chr1 169662007-169854080      + | ENSG00000000460.17 protein_coding
                 ...      ...                 ...    ... .                ...            ...
   ENSG00000288107.1    chr10 133565797-133629507      + |  ENSG00000288107.1         lncRNA
   ENSG00000288108.1     chr5     1931058-1933985      - |  ENSG00000288108.1         lncRNA
   ENSG00000288109.1    chr17   67658015-67659465      - |  ENSG00000288109.1         lncRNA
   ENSG00000288110.1     chr8     4496392-4503392      + |  ENSG00000288110.1         lncRNA
   ENSG00000288111.1     chr3 130181346-130188814      + |  ENSG00000288111.1         lncRNA
                       gene_name    length
                     <character> <integer>
  ENSG00000000003.14      TSPAN6      4535
   ENSG00000000005.6        TNMD      1476
  ENSG00000000419.12        DPM1      1207
  ENSG00000000457.14       SCYL3      6883
  ENSG00000000460.17    C1orf112      5970
                 ...         ...       ...
   ENSG00000288107.1  AL731769.2      5662
   ENSG00000288108.1  AC126768.3       509
   ENSG00000288109.1  AC079331.3       748
   ENSG00000288110.1  AC010941.1      2417
   ENSG00000288111.1  AC130888.1       454
  -------
  seqinfo: 407 sequences from an unspecified genome; no seqlengths

DGEList Objects

dge <- DGEList(
  counts = counts,
  samples = tibble(id = colnames(counts)) %>% left_join(samples),
  genes = genes %>% 
    setNames(.$gene_id) %>% 
    .[rownames(counts)] %>% 
    as.data.frame(row.names = names(.))
)
dge

An object of class "DGEList"
$counts
                   1_control 6_control 11_control 2_Nsp6 7_Nsp6 12_Nsp6 3_Nsp8 8_Nsp8 13_Nsp8  4_M
ENSG00000000003.14      5211      6192       4542   4328   5283    4336   4330   5208    4635 4743
ENSG00000000005.6         65       178         96     50     86      83     67     92      80   51
ENSG00000000419.12      1454      1889       1329   1472   1684    1368   1666   2042    1707 1822
ENSG00000000457.14       628       548        522    560    636     492    530    500     498  622
ENSG00000000460.17       191       241        150    146    254     164    161    179     183  173
                    9_M 14_M
ENSG00000000003.14 5498 5172
ENSG00000000005.6    70   64
ENSG00000000419.12 2293 1987
ENSG00000000457.14  513  560
ENSG00000000460.17  256  228
66733 more rows ...

$samples
           group lib.size norm.factors         id replicate treatment
1_control      1 38633196            1  1_control         1   control
6_control      1 48184412            1  6_control         6   control
11_control     1 34566631            1 11_control        11   control
2_Nsp6         3 33088151            1     2_Nsp6         2      Nsp6
7_Nsp6         3 41621703            1     7_Nsp6         7      Nsp6
7 more rows ...

$genes
                   seqnames     start       end  width strand            gene_id      gene_type
ENSG00000000003.14     chrX 100627109 100639991  12883      - ENSG00000000003.14 protein_coding
ENSG00000000005.6      chrX 100584936 100599885  14950      +  ENSG00000000005.6 protein_coding
ENSG00000000419.12    chr20  50934867  50958555  23689      - ENSG00000000419.12 protein_coding
ENSG00000000457.14     chr1 169849631 169894267  44637      - ENSG00000000457.14 protein_coding
ENSG00000000460.17     chr1 169662007 169854080 192074      + ENSG00000000460.17 protein_coding
                   gene_name length
ENSG00000000003.14    TSPAN6   4535
ENSG00000000005.6       TNMD   1476
ENSG00000000419.12      DPM1   1207
ENSG00000000457.14     SCYL3   6883
ENSG00000000460.17  C1orf112   5970
66733 more rows ...

DGEList Objects

Key elements have different but related dimensions

dim(dge)

[1] 66738    12

dim(dge$counts)

[1] 66738    12

dim(dge$samples)

[1] 12  6

dim(dge$genes)

[1] 66738     9

samples is column metadata \(\implies\) genes is row metadata

Library Sizes

The very first step with counts \(\implies\) library sizes
This is the total number of reads (i.e. RNA fragments) assigned to genes
Ideally >20m reads per sample
Those below 10m reads are generally considered as problematic
- Context dependent

Library Sizes

dge$samples %>% 
  ggplot(aes(replicate, lib.size / 1e6, fill = treatment)) +
  geom_col() +
  facet_wrap(~treatment, nrow = 1, scales = "free_x") +
  labs(x = "Sample", y = "Library Size (millions)", fill = "Treatment") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  scale_fill_brewer(palette = "Set1")

Click to show figure

Library Sizes

We can get fancy and add the mean value

dge$samples %>% 
  ggplot(aes(replicate, lib.size / 1e6, fill = treatment)) +
  geom_col() +
  geom_hline(
    aes(yintercept = lib.size),
    data = . %>% summarise(lib.size = mean(lib.size / 1e6)),
    linetype = 2, colour = "grey30"
  ) +
  facet_wrap(~treatment, nrow = 1, scales = "free_x") +
  labs(x = "Sample", y = "Library Size (millions)", fill = "Treatment") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  scale_fill_brewer(palette = "Set1")

Click to show figure

These look like really good libraries so far

Detected Vs Undetected Genes

By adding counts across samples and seeing how many are zero
\(\implies\) non-detectable genes

dge$counts %>% 
  rowSums() %>% 
  equals(0) %>% 
  sum()

[1] 27211

. . .

Genes with low counts can also be removed
- These end up being highly variable
- Uninformative statistically

Detected Vs Undetected Genes

No set method for considering a gene as too low to bother with
Can use raw counts below some value
Counts within a group above some value \(\implies\) detected within a treatment

. . .

Counts per million reads (CPM) is a common measure used in abundance analysis
Take the library sizes / 1e6 & divide counts by this value \(\implies\) CPM
Can also use logCPM
- Adds a non-zero value to every gene to avoid zeroes

Detected Vs Undetected Genes

edgeR provides a function filterByExpr()
Flags genes for removal or discarding based on counts
Set minimum CPM as min.count \(\times 1e6\) / median(lib.size)
Set minimum total counts

. . .

Genes are retained if CPM > minCPM in \(\geq n\) samples
- \(n\) is the smallest treatment group
Default settings may be a little generous for this dataset

Detected Vs Undetected Genes

If we require > 1CPM in at least 3 samples

min_count <- dge$samples$lib.size %>% 
  median() %>% 
  divide_by(1e6) %>% 
  round(0)
min_count

[1] 38

min_total <- 150
keep <- filterByExpr(dge, min.count = min_count, min.total.count = 150)

Detected Vs Undetected Genes

A <- dge$counts %>% 
  cpm(log = TRUE) %>% 
  as_tibble(rownames = "gene_id") %>% 
  pivot_longer(-all_of("gene_id"), names_to = "id", values_to = "logCPM") %>% 
  left_join(dge$samples) %>% 
  ggplot(aes(logCPM, after_stat(density), group = id)) +
  geom_density(aes(colour = treatment)) +
  ggtitle(glue("Before Filtering: {comma(length(keep))} genes")) +
  labs(y = "Density", colour = "Treatment") +
  scale_colour_brewer(palette = "Set1")
B <- dge$counts[keep,] %>% 
  cpm(log = TRUE) %>% 
  as_tibble(rownames = "gene_id") %>% 
  pivot_longer(-all_of("gene_id"), names_to = "id", values_to = "logCPM") %>% 
  left_join(dge$samples) %>% 
  ggplot(aes(logCPM, after_stat(density), group = id)) +
  geom_density(aes(colour = treatment)) +
  ggtitle(glue("After Filtering: {comma(sum(keep))} genes")) +
  labs(y = "Density", colour = "Treatment") +
  scale_colour_brewer(palette = "Set1")
A + B + plot_layout(guides = "collect", axes = "collect")

Click to show figure

Detected Vs Undetected Genes

Create new object once satisfied with the filtering strategy
Passing the logical vector will subset counts & genes correctly
- The original library sizes are retained

dge_filter <- dge[keep,]
dge_filter

An object of class "DGEList"
$counts
                   1_control 6_control 11_control 2_Nsp6 7_Nsp6 12_Nsp6 3_Nsp8 8_Nsp8 13_Nsp8  4_M
ENSG00000000003.14      5211      6192       4542   4328   5283    4336   4330   5208    4635 4743
ENSG00000000005.6         65       178         96     50     86      83     67     92      80   51
ENSG00000000419.12      1454      1889       1329   1472   1684    1368   1666   2042    1707 1822
ENSG00000000457.14       628       548        522    560    636     492    530    500     498  622
ENSG00000000460.17       191       241        150    146    254     164    161    179     183  173
                    9_M 14_M
ENSG00000000003.14 5498 5172
ENSG00000000005.6    70   64
ENSG00000000419.12 2293 1987
ENSG00000000457.14  513  560
ENSG00000000460.17  256  228
13891 more rows ...

$samples
           group lib.size norm.factors         id replicate treatment
1_control      1 38633196            1  1_control         1   control
6_control      1 48184412            1  6_control         6   control
11_control     1 34566631            1 11_control        11   control
2_Nsp6         3 33088151            1     2_Nsp6         2      Nsp6
7_Nsp6         3 41621703            1     7_Nsp6         7      Nsp6
7 more rows ...

$genes
                   seqnames     start       end  width strand            gene_id      gene_type
ENSG00000000003.14     chrX 100627109 100639991  12883      - ENSG00000000003.14 protein_coding
ENSG00000000005.6      chrX 100584936 100599885  14950      +  ENSG00000000005.6 protein_coding
ENSG00000000419.12    chr20  50934867  50958555  23689      - ENSG00000000419.12 protein_coding
ENSG00000000457.14     chr1 169849631 169894267  44637      - ENSG00000000457.14 protein_coding
ENSG00000000460.17     chr1 169662007 169854080 192074      + ENSG00000000460.17 protein_coding
                   gene_name length
ENSG00000000003.14    TSPAN6   4535
ENSG00000000005.6       TNMD   1476
ENSG00000000419.12      DPM1   1207
ENSG00000000457.14     SCYL3   6883
ENSG00000000460.17  C1orf112   5970
13891 more rows ...

Normalisation

Normalisation is a very common step in transcriptomic analysis
Helps to account for differences in library composition
- e.g. are some libraries dominated by a handful of highly abundant genes
The default in edgeR is TMM (Robinson and Oshlack 2010)
DESeq2 uses RLE normalisation
- Very similar in principle

Statistical Analysis of Counts

Poisson models number of events per unit
- e.g. telephone calls per minute
The rate parameter (\(\lambda\)) models this
mean occurrence = average rate = \(\lambda\)
Variance is fixed at \(\lambda\)
- If variance \(\neq \lambda \implies\) not Poisson data

. . .

The basic model for edgeR is a Negative Binomial Model
- NB: mean = \(\lambda\); variance = \(\lambda + r\)
Multiple strategies for obtaining estimates

Statistical Analysis of Counts

Counts / gene (i.e. \(y_{gi}\)) are a function of library size (\(L_i\))
Gene length is fixed across samples \(\implies\) library size is not
Dividing by library size is a way of normalising counts

\[ \log (y_{gi} / L_i) = \beta_0 + \beta_1 + ... \]

Normalisation

Calculation of normalisation factors (\(Y_i\)) per sample
- The idealised normalisation factor is \(Y_i = 1\)
Can use scaled library sizes \(L^*_i = Y_i \times L_i\)

\[ \log (y_{gi} / L^*_i) = \beta_0 + \beta_1 + ... \]

TMM Normalisation

dge_filter <- calcNormFactors(dge_filter)
dge_filter$samples

           group lib.size norm.factors         id replicate treatment
1_control      1 38633196    0.9943180  1_control         1   control
6_control      1 48184412    0.9770121  6_control         6   control
11_control     1 34566631    0.9883140 11_control        11   control
2_Nsp6         3 33088151    1.0157571     2_Nsp6         2      Nsp6
7_Nsp6         3 41621703    0.9963827     7_Nsp6         7      Nsp6
12_Nsp6        3 32724713    1.0060007    12_Nsp6        12      Nsp6
3_Nsp8         4 33502022    1.0076816     3_Nsp8         3      Nsp8
8_Nsp8         4 41364663    0.9977052     8_Nsp8         8      Nsp8
13_Nsp8        4 36288023    1.0027832    13_Nsp8        13      Nsp8
4_M            2 36851725    1.0081808        4_M         4         M
9_M            2 41546877    0.9987653        9_M         9         M
14_M           2 43587502    1.0076991       14_M        14         M

The clustering around 1 suggests very similar libraries

PCA Analysis

edgeR offers plotMDS() for quick exploration

plotMDS(
  dge_filter, 
  col = as.integer(dge_filter$samples$treatment)
)

Takes top 500 most variable genes
Uses base graphics

PCA Analysis

logCPM values are very useful for PCA
- Will now be calculated using the normalisation factors

pca <- dge_filter %>% 
  cpm(log = TRUE) %>% 
  t() %>% 
  prcomp(center = TRUE, scale. = TRUE)
summary(pca)

Importance of components:
                           PC1     PC2     PC3     PC4     PC5      PC6      PC7      PC8      PC9
Standard deviation     66.6493 50.4749 45.2310 31.7401 29.3998 25.88752 24.74707 21.51801 20.94233
Proportion of Variance  0.3197  0.1833  0.1472  0.0725  0.0622  0.04823  0.04407  0.03332  0.03156
Cumulative Proportion   0.3197  0.5030  0.6502  0.7227  0.7849  0.83316  0.87723  0.91056  0.94212
                           PC10     PC11      PC12
Standard deviation     20.27916 19.82675 2.753e-13
Proportion of Variance  0.02959  0.02829 0.000e+00
Cumulative Proportion   0.97171  1.00000 1.000e+00

prop_var <- summary(pca)$importance["Proportion of Variance",]

PCA Analysis

PC1 vs PC2 shown below \(\implies\) easily modified to other components

pca %>% 
  broom::tidy() %>% 
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>% 
  dplyr::rename(id = row) %>% 
  left_join(dge_filter$samples) %>% 
  ggplot(aes(PC1, PC2, colour = treatment)) +
  geom_text(aes(label = id), show.legend = FALSE) +
  scale_colour_brewer(palette = "Set1")

Click to show figure

pca %>% 
  broom::tidy() %>% 
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>% 
  dplyr::rename(id = row) %>% 
  left_join(dge_filter$samples) %>% 
  ggplot(aes(PC1, PC2, colour = treatment)) +
  geom_point() +
  geom_text_repel(aes(label = id), show.legend = FALSE) +
  labs(
    x = glue("PC1 ({percent(prop_var['PC1'])} Variance)"),
    y = glue("PC2 ({percent(prop_var['PC2'])} Variance)"),
    colour = "Treatment"
  ) +
  scale_colour_brewer(palette = "Set1")

Click to show figure

PCA Analysis

What do we think the above is telling us?
Does it look like treatment is contributing to the variance?
Are there any other potential sources of variability?

Looks to me like samples 2-4 had transfection issues
Need to discuss with wet lab folks

The Design Matrix

Before proceeding need to define a design matrix

X <- model.matrix(~treatment, dge_filter$samples)
X

           (Intercept) treatmentM treatmentNsp6 treatmentNsp8
1_control            1          0             0             0
6_control            1          0             0             0
11_control           1          0             0             0
2_Nsp6               1          0             1             0
7_Nsp6               1          0             1             0
12_Nsp6              1          0             1             0
3_Nsp8               1          0             0             1
8_Nsp8               1          0             0             1
13_Nsp8              1          0             0             1
4_M                  1          1             0             0
9_M                  1          1             0             0
14_M                 1          1             0             0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$treatment
[1] "contr.treatment"

The Design Matrix

The first column will estimate average expression in control samples
Remaining columns estimate the difference due to treatments
(Almost) always modelled on the log2 scale \(\implies\) logFC

Estimating Dispersions

We can estimate the dispersion parameter across the entire dataset
edgeR models consider technical noise as Poisson with biological noise as over-dispersion
This adds multiple elements to the DGEList
- Essential for edgeR models
Initial estimates are moderated using a global trend during model fitting

dge_filter <- estimateDisp(dge_filter, design = X)

Estimating Dispersions

# Output uses base graphics no ggplot
plotBCV(dge_filter, pch = 2)

Model Fitting

The recommended approach for edgeR is to use Quasi-Likelihood fits (Lund et al. 2012)
- Allows flexibility in over-dispersion parameters
First we fit the model \(\implies\) then perform testing
Using our design matrix, we will fit the model for all coefficients in X
- Perform tests within each co-efficient
- We’ll use the ‘M’ treatment

fit <- glmQLFit(dge_filter, design = X)

Model Testing

Statistical tests use quasi-likelihood F-tests
Use moderate dispersions (Empirical Bayes)
By default, the top10 genes are returned

fit %>% glmQLFTest(coef = "treatmentM") %>% topTags()

Coefficient:  treatmentM 
                   seqnames     start       end  width strand            gene_id      gene_type
ENSG00000065609.14     chr6  83552880  83709691 156812      - ENSG00000065609.14 protein_coding
ENSG00000133878.9      chr8  33591330  33600023   8694      -  ENSG00000133878.9 protein_coding
ENSG00000112837.17     chr6  84687351  84764598  77248      - ENSG00000112837.17 protein_coding
ENSG00000125730.16    chr19   6677704   6730562  52859      - ENSG00000125730.16 protein_coding
ENSG00000144810.16     chr3  99638475  99799226 160752      + ENSG00000144810.16 protein_coding
ENSG00000112562.18     chr6 168441151 168673445 232295      + ENSG00000112562.18 protein_coding
ENSG00000118137.9     chr11 116835751 116837950   2200      -  ENSG00000118137.9 protein_coding
ENSG00000101335.10    chr20  36541497  36551447   9951      + ENSG00000101335.10 protein_coding
ENSG00000156113.23    chr10  76869601  77638369 768769      - ENSG00000156113.23 protein_coding
ENSG00000165118.15     chr9  83938311  83956986  18676      - ENSG00000165118.15 protein_coding
                   gene_name length     logFC    logCPM         F       PValue          FDR
ENSG00000065609.14    SNAP91   6651 -4.298482 5.7855158 1244.7901 4.457146e-13 6.193650e-09
ENSG00000133878.9     DUSP26   1971 -6.406627 0.9996001  320.3749 8.859485e-11 6.155570e-07
ENSG00000112837.17     TBX18   8654  4.614608 1.3852329  463.1117 1.329742e-10 6.159367e-07
ENSG00000125730.16        C3   8328  7.404154 1.9905031  222.5631 7.853570e-10 2.728330e-06
ENSG00000144810.16    COL8A1   9408  1.897573 4.2552562  221.7729 7.221342e-09 2.006955e-05
ENSG00000112562.18     SMOC2   4032  2.975471 2.9960180  214.3021 8.720084e-09 2.019572e-05
ENSG00000118137.9      APOA1   1424 -1.909907 6.8931539  201.6088 1.216484e-08 2.414894e-05
ENSG00000101335.10      MYL9   2808  2.208139 9.0902053  192.7708 1.553354e-08 2.553771e-05
ENSG00000156113.23    KCNMA1  35644  2.906831 4.0355987  190.5689 1.653997e-08 2.553771e-05
ENSG00000165118.15   C9orf64   2754  5.728040 1.8379059  176.1002 2.011799e-08 2.556070e-05

Model Testing

Can easily collect all genes into a manageable object

res_M <- fit %>% 
  glmQLFTest(coef = "treatmentM") %>%
  topTags(n = Inf) %>% 
  as.data.frame() %>% 
  as_tibble()
res_M

# A tibble: 13,896 × 14
   seqnames     start       end  width strand gene_id  gene_type gene_name length logFC logCPM     F
   <fct>        <int>     <int>  <int> <fct>  <chr>    <chr>     <chr>      <int> <dbl>  <dbl> <dbl>
 1 chr6      83552880  83709691 156812 -      ENSG000… protein_… SNAP91      6651 -4.30   5.79 1245.
 2 chr8      33591330  33600023   8694 -      ENSG000… protein_… DUSP26      1971 -6.41   1.00  320.
 3 chr6      84687351  84764598  77248 -      ENSG000… protein_… TBX18       8654  4.61   1.39  463.
 4 chr19      6677704   6730562  52859 -      ENSG000… protein_… C3          8328  7.40   1.99  223.
 5 chr3      99638475  99799226 160752 +      ENSG000… protein_… COL8A1      9408  1.90   4.26  222.
 6 chr6     168441151 168673445 232295 +      ENSG000… protein_… SMOC2       4032  2.98   3.00  214.
 7 chr11    116835751 116837950   2200 -      ENSG000… protein_… APOA1       1424 -1.91   6.89  202.
 8 chr20     36541497  36551447   9951 +      ENSG000… protein_… MYL9        2808  2.21   9.09  193.
 9 chr10     76869601  77638369 768769 -      ENSG000… protein_… KCNMA1     35644  2.91   4.04  191.
10 chr9      83938311  83956986  18676 -      ENSG000… protein_… C9orf64     2754  5.73   1.84  176.
# ℹ 13,886 more rows
# ℹ 2 more variables: PValue <dbl>, FDR <dbl>

Multiple Testing

Under H₀: ~1 in 20 random p-values will be < 0.05
With ~14,000 genes \(\implies\) about 700 expected
These would be Type 1 errors \(\implies\) false positives
- We say something is happening, when it’s just noise

. . .

The False Discovery Rate (FDR) controls Type 1 errors at a given rate
- Using \(\alpha = 0.05 \implies\) happy with 5% of results being errors
- Always just an estimate

sum(res_M$FDR < 0.05)

[1] 1409

Inspection of P-values

Under H₀ \(\implies\) \(p \sim \mathcal{U}(0, 1)\)
Under H_A no distribution \(\implies\) peak near zero

res_M %>% 
  ggplot(aes(PValue)) +
  geom_histogram(binwidth = 0.01, fill = "grey70", colour = "black") +
  scale_y_continuous(expand = expansion(c(0, 0.05)))

Click to show figure

Looks as expected

Volcano Plots

Plot developed during microarray analysis
Shows logFC against significance (\(p\)-values)

res_M %>% 
  ggplot(aes(logFC, -log10(PValue), colour = FDR < 0.05)) +
  geom_point() +
  geom_label_repel(
    aes(label = gene_name),
    data = . %>% 
      arrange(PValue) %>% 
      dplyr::slice(1:10),
    size = 3, show.legend = FALSE
  ) +
  scale_colour_manual(values = c("black", "red"))

Click to show figure

MA Plots

Two-colour microarray data introduced MA plots
Signal from two-colour arrays layered treat vs control in a single spot
- A for intensity averages
- M for mean intensity differences
Still commonly used and the language remains
Sometimes called MD (mean-difference) plots
- edgeR provides plotMD()
Show average expression on the x-axis and logFC on y-axis
- How highly expressed are any DE genes?

MA Plots

Often start with an exploration then customise for our data

res_M %>% 
  ggplot(aes(logCPM, logFC)) +
  geom_point(aes(colour = FDR < 0.05), alpha = 0.7) +
  scale_colour_manual(values = c("black", "red"))

Click to show figure

res_M %>% 
  arrange(desc(PValue)) %>% # Why do this?
  ggplot(aes(logCPM, logFC, colour = FDR < 0.05)) +
  geom_point(alpha = 0.7) +
  geom_smooth(se = FALSE, colour = 'royalblue3', method = "lm") +
  geom_label_repel(
    aes(label = gene_name),
    data = . %>% 
      dplyr::filter(FDR < 0.05) %>% 
      arrange(desc(abs(logFC))) %>% 
      dplyr::slice(1:10),
    size = 3, show.legend = FALSE
  ) +
  scale_colour_manual(values = c("black", "red"))

Click to show figure

Filtering of DE Genes

There is a tendency of the most extreme logFC to be in low expressed genes
The ratio of any number over a small number will \(\rightarrow \infty\)
If there are “too many” DE genes \(\implies\) should we filter results?

. . .

Early approaches were to filter on extreme logFC
Biased results towards low-expressed genes

Range-Based Hypothesis Testing

Traditional H₀ is \(\mu = 0\) against H_A \(\mu \neq 0\)
An alternative is to use a range-based H₀ (McCarthy and Smyth 2009)
Implemented in glmTreat()
Default range is FC > 20% in either direction

res_treatM <- fit %>% 
  glmTreat(coef = "treatmentM", lfc = log2(1.2)) %>% 
  topTags(n = Inf) %>% 
  as.data.frame() %>% 
  as_tibble()
sum(res_treatM$FDR < 0.05)

[1] 596

Range-Based Hypothesis Testing

res_treatM %>% 
  arrange(desc(PValue)) %>% # Why do this?
  ggplot(aes(logCPM, logFC, colour = FDR < 0.05)) +
  geom_point(alpha = 0.7) +
  geom_smooth(se = FALSE, colour = 'royalblue3', method = "lm") +
  geom_label_repel(
    aes(label = gene_name),
    data = . %>% 
      dplyr::filter(FDR < 0.05) %>% 
      arrange(desc(abs(logFC))) %>% 
      dplyr::slice(1:10),
    size = 3, show.legend = FALSE
  ) +
  scale_colour_manual(values = c("black", "red"))

Click to show figure

Clearly some high logFC genes had weak statistical support

Checking Individual Genes

Checking individual genes for changes in expression can be helpful
Choosing a small number helps with interpretability

cpm(dge_filter, log = TRUE) %>% 
  .[res_treatM$gene_id[1:9],] %>% 
  as_tibble(rownames = "gene_id") %>% 
  pivot_longer(-all_of("gene_id"), names_to = "id", values_to = "logCPM") %>% 
  left_join(dge$genes) %>% 
  left_join(dge$samples) %>%
  mutate(gene_name = fct_inorder(gene_name)) %>% 
  ggplot(aes(treatment, logCPM, fill = treatment)) +
  geom_boxplot() +
  facet_wrap(~gene_name, scales = "free_y") +
  scale_fill_brewer(palette = "Set1")

Click to show figure

Heatmaps Across Multiple Genes

Showing larger groups of genes can also be informative
Enables very basic clustering of genes

. . .

The package pheatmap is a common resource
Pre-dates the tidyverse \(\implies\) matrix and data.frame objects

Heatmaps Across Multiple Genes

Form a matrix from the top genes
Subtract the mean expression value
- This retains variability but shows direction of change
Set rownames to be the gene-names

mat <- cpm(dge_filter, log = TRUE)[res_treatM$gene_id[1:25],] 
mat <- mat - rowMeans(mat)
rownames(mat) <- setNames(genes$gene_name, genes$gene_id)[rownames(mat)]

. . .

Choosing too many genes can confuse rather than illustrate the results!!!

Heatmaps Across Multiple Genes

Groups can be set using column annotations (annotation_col)
- Requires a data.frame with rownames
Group colours can also be passed as a list

pheatmap(
  mat, 
  annotation_col = samples %>%
    dplyr::select(id, treatment) %>% 
    as.data.frame() %>% 
    column_to_rownames("id"),
  annotation_colors = list(
    treatment = setNames(
      RColorBrewer::brewer.pal(4, "Set1"), levels(samples$treatment)
    )
  ),
  cutree_rows = 4
)

Click to show figure

Every column in the annotation df will be added as an annotation

Using DESeq2

DESeq2

Another common package for RNASeq analysis is DESeq2 (Love, Huber, and Anders 2014)
Uses S4 object more interconnected to other Bioconductor classes
Different approach to dispersions
Different statistical testing
Feels a bit more like a “black-box” to me
- More default settings for a similar workflow \(\implies\) fewer manual steps
Is a very high-quality approach (as is edgeR)

DESeq2

library(DESeq2)
library(extraChIPs)
dds <- DESeqDataSetFromMatrix(
  dge_filter$counts, 
  colData = samples,
  rowData = dge_filter$genes %>% 
    makeGRangesFromDataFrame(keep.extra.columns = TRUE),
  design = ~ treatment
)
is(dds)

[1] "DESeqDataSet"               "RangedSummarizedExperiment" "SummarizedExperiment"      
[4] "RectangularData"            "Vector"                     "Annotated"                 
[7] "vector_OR_Vector"

Summarized Experiment Objects

SummarizedExperiment objects are a core Bioconductor class

getSlots("SummarizedExperiment")

            colData              assays               NAMES     elementMetadata            metadata 
        "DataFrame"    "Assays_OR_NULL" "character_OR_NULL"         "DataFrame"              "list"

. . .

Any number of assays can be added
- Usually matrices with identical dimensions
colData holds the sample-level metadata

Summarized Experiment Objects

RangedSummarizedExperiment objects are a subclass of SummarizedExperiment

getSlots("RangedSummarizedExperiment")

                     rowRanges                        colData                         assays 
"GenomicRanges_OR_GRangesList"                    "DataFrame"               "Assays_OR_NULL" 
                         NAMES                elementMetadata                       metadata 
           "character_OR_NULL"                    "DataFrame"                         "list"

. . .

All slots from a SummarizedExperiment
rowRanges allow metadata for each row of counts
- Usually a GRanges with optional mcols

DESeqDataSet Objects

Extends RangedSummarizedExperiment objects

getSlots("DESeqDataSet")

                        design             dispersionFunction                      rowRanges 
                         "ANY"                     "function" "GenomicRanges_OR_GRangesList" 
                       colData                         assays                          NAMES 
                   "DataFrame"               "Assays_OR_NULL"            "character_OR_NULL" 
               elementMetadata                       metadata 
                   "DataFrame"                         "list"

. . .

Includes a design matrix & dispersionFunction

DESeqDataSet Objects

Multiple data access method exist
- colData(), rowData(), rowRanges()
- assay(), counts()

colData(dds)
rowRanges(dds)
rowData(dds)
assay(dds, "counts") %>% head()
counts(dds) %>% head()

DESeqDataSet Objects

All functions & methods written for parent classes work with a DESeqDataSet
tidyomics specifically developed for SummarizedExperiment objects
Numerous functions & methods for SummarizedExperiment objects

. . .

Let’s look at a few from extraChIPs
First add a logCPM assay

assay(dds, "logCPM") <- cpm(dds, log = TRUE)
assays(dds)

List of length 2
names(2): counts logCPM

I know the methods in extraChIPs best given I wrote them. Sorry…

DESeqDataSet Objects

plotAssayPCA(
  dds, assay = "logCPM", 
  colour = "treatment", label = "id"
) +
  labs(colour = "Treatment") +
  scale_colour_brewer(palette = "Set1")

DESeqDataSet Objects

plotAssayPCA(
  dds, assay = "logCPM", 
  colour = "treatment", label = "id",
  pc_x = 2, pc_y = 3 # Change the PCs
) +
  labs(colour = "Treatment") +
  scale_colour_brewer(palette = "Set1")

DESeqDataSet Objects

plotAssayDensities(
  dds, "logCPM", colour = "treatment"
  ) +
  labs(colour = "Treatment") +
  scale_colour_brewer(palette = "Set1")

DESeqDataSet Objects

plotAssayDensities(
  dds, colour = "treatment", trans = "log1p"
  ) +
  labs(colour = "Treatment") +
  scale_colour_brewer(palette = "Set1")

DESeqDataSet Objects

plotAssayRle(
  dds, "logCPM", 
  fill = "treatment", by_x = "replicate"
) +
  facet_wrap(
    ~treatment, nrow = 1, scales = "free_x"
  ) +
  labs(fill = "Treatment") +
  scale_fill_brewer(palette = "Set1")

## RLE shows if any sample shows consistently
## higher or lower counts

This makes another excellent QC check
Usually fine, but a good tool to know

DGE Analysis

The function DESeq() wraps multiple steps
Normalisation factors \(\equiv\) sizeFactors added to colData
- Uses the RLE method (very similar to TMM)
Additional assays will be added along with rowData columns
Testing is performed using the Wald Test

dds <- DESeq(dds)
dds

class: DESeqDataSet 
dim: 13896 12 
metadata(1): version
assays(5): counts logCPM mu H cooks
rownames(13896): ENSG00000000003.14 ENSG00000000005.6 ... ENSG00000288022.1
  ENSG00000288066.1
rowData names(34): gene_id gene_type ... deviance maxCooks
colnames(12): 1_control 6_control ... 9_M 14_M
colData names(5): id replicate treatment group sizeFactor

Dispersions

# Dispersions are also moderated by `DESeq2`
plotDispEsts(dds)

DGE Results

resultsNames(dds)

[1] "Intercept"                 "treatment_M_vs_control"    "treatment_Nsp6_vs_control"
[4] "treatment_Nsp8_vs_control"

res_deseq2_M <- results(dds, name = "treatment_M_vs_control")
res_deseq2_M

log2 fold change (MLE): treatment M vs control 
Wald test p-value: treatment M vs control 
DataFrame with 13896 rows and 6 columns
                    baseMean log2FoldChange     lfcSE       stat      pvalue        padj
                   <numeric>      <numeric> <numeric>  <numeric>   <numeric>   <numeric>
ENSG00000000003.14 4918.2962     -0.0734152 0.0548639  -1.338132 1.80853e-01 0.444613885
ENSG00000000005.6    80.3602     -0.8692770 0.3071633  -2.830016 4.65457e-03 0.036358343
ENSG00000000419.12 1710.4219      0.3642342 0.0843509   4.318081 1.57392e-05 0.000353798
ENSG00000000457.14  551.4245     -0.0386804 0.1648748  -0.234605 8.14516e-01 0.923290340
ENSG00000000460.17  190.9980      0.1536008 0.1744186   0.880645 3.78510e-01 0.650754388
...                      ...            ...       ...        ...         ...         ...
ENSG00000287966.1    43.9669     -0.5843052  0.448175 -1.3037444  0.19232074   0.4593943
ENSG00000287978.1    75.1925      0.0260297  0.412190  0.0631498  0.94964723   0.9831753
ENSG00000288012.1    32.8771     -0.2789995  0.462572 -0.6031481  0.54641019   0.7777428
ENSG00000288022.1    36.4753     -0.8772781  0.328395 -2.6714106  0.00755332   0.0523458
ENSG00000288066.1   438.5152      0.0771334  0.210274  0.3668230  0.71375102   0.8722577

DGE Results

This analysis gave 1,972 DE genes
To sort by significance

res_deseq2_M %>% 
  as_tibble(rownames = "gene_id") %>% 
  left_join(as_tibble(genes)) %>% 
  arrange(pvalue) %>% 
  dplyr::select(range, starts_with("gene"), everything())

# A tibble: 13,896 × 11
   range gene_id gene_type gene_name baseMean log2FoldChange  lfcSE  stat    pvalue      padj length
   <chr> <chr>   <chr>     <chr>        <dbl>          <dbl>  <dbl> <dbl>     <dbl>     <dbl>  <int>
 1 chr6… ENSG00… protein_… SNAP91      2097.           -4.29 0.123  -34.9 2.49e-267 3.45e-263   6651
 2 chr3… ENSG00… protein_… COL8A1       727.            1.91 0.136   14.0 7.91e- 45 5.49e- 41   9408
 3 chr6… ENSG00… protein_… SMOC2        302.            2.98 0.215   13.9 1.21e- 43 5.61e- 40   4032
 4 chr1… ENSG00… protein_… KCNMA1       624.            2.91 0.212   13.8 3.59e- 43 1.25e- 39  35644
 5 chr2… ENSG00… protein_… MYL9       20813.            2.22 0.162   13.7 1.13e- 42 3.14e- 39   2808
 6 chr6… ENSG00… protein_… TBX18         97.4           4.65 0.340   13.7 1.41e- 42 3.26e- 39   8654
 7 chr1… ENSG00… protein_… APOA1       4526.           -1.90 0.143  -13.3 3.88e- 40 7.70e- 37   1424
 8 chr9… ENSG00… protein_… BNC2         257.            3.50 0.265   13.2 5.34e- 40 9.28e- 37  15323
 9 chr2… ENSG00… protein_… ADA2         734.           -1.20 0.0960 -12.5 7.89e- 36 1.22e- 32   8704
10 chr6… ENSG00… protein_… COL21A1      343.           -2.37 0.198  -12.0 5.38e- 33 7.48e- 30   6912
# ℹ 13,886 more rows

This takes advantage of as_tibble() twice. Once to convert a DataFrame, with the second to convert a GRanges

Visualising Results

plotMA(res_deseq2_M)

Filtering Results

The approach taken by DESeq2 is different from edgeR’s range-based H₀
Shrinks the estimates of logFC (Zhu, Ibrahim, and Love 2019)
Returns s-values (false sign or small)

res_lfcshrink_M <- lfcShrink(
  dds, "treatment_M_vs_control", lfcThreshold = log2(1.2)
)
res_lfcshrink_M

log2 fold change (MAP): treatment M vs control 
 
DataFrame with 13896 rows and 4 columns
                    baseMean log2FoldChange     lfcSE    svalue
                   <numeric>      <numeric> <numeric> <numeric>
ENSG00000000003.14 4918.2962     -0.0675130 0.0529894 0.7792084
ENSG00000000005.6    80.3602     -0.6065355 0.3664508 0.0413765
ENSG00000000419.12 1710.4219      0.3378204 0.0851769 0.0458250
ENSG00000000457.14  551.4245     -0.0192876 0.1236106 0.6995773
ENSG00000000460.17  190.9980      0.0854692 0.1358403 0.5657148
...                      ...            ...       ...       ...
ENSG00000287966.1    43.9669    -0.09359097  0.200388 0.4204071
ENSG00000287978.1    75.1925     0.00454662  0.168958 0.6258674
ENSG00000288012.1    32.8771    -0.03896311  0.176866 0.5545518
ENSG00000288022.1    36.4753    -0.56074752  0.410878 0.0593301
ENSG00000288066.1   438.5152     0.03309121  0.140818 0.6493741

Filtering Results

This has reduced the original 1,972 DE genes to 1,270

res_lfcshrink_M %>% 
  as_tibble(rownames = "gene_id") %>% 
  left_join(as_tibble(genes)) %>% 
  arrange(svalue) %>% 
  dplyr::filter(svalue < 0.05) %>% 
  dplyr::select(range, starts_with("gene"), everything())

# A tibble: 1,270 × 9
   range                  gene_id gene_type gene_name baseMean log2FoldChange  lfcSE   svalue length
   <chr>                  <chr>   <chr>     <chr>        <dbl>          <dbl>  <dbl>    <dbl>  <int>
 1 chr6:83552880-8370969… ENSG00… protein_… SNAP91      2097.          -4.28  0.123  0          6651
 2 chr22:17178790-172582… ENSG00… protein_… ADA2         734.          -1.18  0.0966 0          8704
 3 chr11:116835751-11683… ENSG00… protein_… APOA1       4526.          -1.88  0.144  0          1424
 4 chr6:56056590-5639409… ENSG00… protein_… COL21A1      343.          -2.33  0.199  0          6912
 5 chr8:33591330-3360002… ENSG00… protein_… DUSP26        73.3         -6.34  0.597  0          1971
 6 chr9:128255829-128275… ENSG00… protein_… GOLGA2      1615.          -0.942 0.0817 0          6961
 7 chr6:84687351-8476459… ENSG00… protein_… TBX18         97.4          4.60  0.338  9.79e-39   8654
 8 chr6:168441151-168673… ENSG00… protein_… SMOC2        302.           2.95  0.216  1.28e-36   4032
 9 chr10:76869601-776383… ENSG00… protein_… KCNMA1       624.           2.88  0.212  5.28e-36  35644
10 chr9:16409503-1687084… ENSG00… protein_… BNC2         257.           3.46  0.266  1.29e-34  15323
# ℹ 1,260 more rows

Comparison of Approaches

Using the most conservative analyses
Genes along zero for DESeq2 were a result of shrinkage

Export Results

Save the DGEList using the following

write_rds(dge_filter, here::here("data/dge_filter.rds"))

Save the edgeR results using the following

write_rds(res_treatM, here::here("data/res_treatM.rds"))

Conclusion

DESeq2 data structures integrate nicely with other Bioconductor packages
edgeR can also handle SummarizedExperiment objects (mostly)
No right or wrong method or data classes to use
- Results were still broadly consistent
DESeq2 also has many options for parameter/model choice not covered here

. . .

SummarizedExperiment objects share much with SingleCellExperiment objects
Have helped with multiple analyses & seem preferable to Seurat (IMHO)
- Seurat is everywhere but I spend a lot of time debugging for people

References

Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.

Liu, Juli, Shiyong Wu, Yucheng Zhang, Cheng Wang, Sheng Liu, Jun Wan, and Lei Yang. 2023. “SARS-CoV-2 Viral Genes Nsp6, Nsp8, and M Compromise Cellular ATP Levels to Impair Survival and Function of Human Pluripotent Stem Cell-Derived Cardiomyocytes.” Stem Cell Res. Ther. 14 (1): 249.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.

Lund, Steven P, Dan Nettleton, Davis J McCarthy, and Gordon K Smyth. 2012. “Detecting Differential Expression in RNA-sequence Data Using Quasi-Likelihood with Shrunken Dispersion Estimates.” Stat. Appl. Genet. Mol. Biol. 11 (5).

McCarthy, Davis J, and Gordon K Smyth. 2009. “Testing Significance Relative to a Fold-Change Threshold Is a TREAT.” Bioinformatics 25 (6): 765–71.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-seq Data.” Genome Biol. 11 (3): R25.

Zhu, Anqi, Joseph G Ibrahim, and Michael I Love. 2019. “Heavy-Tailed Prior Distributions for Sequence Count Data: Removing the Noise and Preserving Large Differences.” Bioinformatics 35 (12): 2084–92.