References and Ranges

RAdelaide 2024

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 11, 2024

Reference Sequences

Fasta Files

  • Used for entire genomes or small, single sequences
  • Sequence name + optional “free-form” metadata begins with >
  • Followed by sequence with no prefix
    • Single sequences commonly wrapped on multiple lines
  • Sequence can be DNA, RNA or Protein
>ERCC-00002
TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGGGCGTCG
GCATCCAGACCGTCGGCTGATCGTGGTTTTACTAGGCTAGACTAGCGTAC
GAGCACTATGGTCAGTAATTCCTGGAGGAATAGGTACCAAGAAAAAAACG
AACCTTTGGGTTCCAGAGCTGTACGGTCGCACTGAACTCGGATAGGTCTC
AGAAAAACGAAATATAGGCTTACGGTAGGTCCGAATGGCACAAAGCTTGT
TCCGTTAGCTGGCATAAGATTCCATGCCTAGATGTGATACACGTTTCTGG
AAACTGCCTCGTCATGCGACTGTTCCCCGGGGTCAGGGCCGCTGGTATTT
GCTGTAAAGAGGGGCGTTGAGTCCGTCCGACTTCACTGCCCCCTTTCAGC
CTTTTGGGTCCTGTATCCCAATTCTCAGAGGTCCCGCCGTACGCTGAGGA

Sometimes use a standardised set of IUPAC codes

Biostrings

  • The Bioconductor package for handling these is Biostrings
  • Depending on the sequence structure
    \(\implies\) DNAString, DNAStringSet or DNAStringSetList
library(Biostrings)
DNAString("AACGT")
5-letter DNAString object
seq: AACGT
DNAStringSet("AACGT")
DNAStringSet object of length 1:
    width seq
[1]     5 AACGT
DNAStringSetList("AACGT")
DNAStringSetList of length 1
[[1]] AACGT

By have a new class (not just characters) many operations become simplified

Biostrings

  • General class for RNA or Proteins \(\implies\) XString/Set/List
    • RNAString* or AAString*
  • XStringSet objects are lists of XString objects
    • XStringSetList objects are lists of XStringSet objects
DNAStringSet("AACGT") |> unlist()
5-letter DNAString object
seq: AACGT
DNAStringSetList("AACGT") |> unlist()
DNAStringSet object of length 1:
    width seq
[1]     5 AACGT

Biostrings

  • If working with complete transcripts \(\implies\) XStringSet
  • When needing exon structure within transcripts \(\implies\) XStringSetList

Biostrings

  • Using some example ERCC spike-in data
    • Short, known sequences sometimes used for RNA-Seq QC + Normalisation
ercc <- readDNAStringSet("data/ERCC92.fa.gz")
ercc
DNAStringSet object of length 92:
     width seq                                                                  names               
 [1]  1061 TCCAGATTACTTCCATTTCCGCCCAAGCTGCTC...TTACCCTTAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00002
 [2]  1023 CAGCAGCGATTAAGGCAGAGGCGTTTGTATCTG...CACAAGATAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00003
 [3]   523 TCTTGCTTCAACAATAACGTCTCTTTCAGAAGG...AGGCACATAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00004
 [4]   984 CAATGATAGGCTAGTCTCGCGCAGTACATGGTA...ACCGATATAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00009
 [5]   994 CGAGAGATGTTTGTAGGTGCGGAATGTGTGCGG...CTATGGACAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00012
 ...   ... ...
[88]  1022 AGGAGCTCCAGTAGTTTTCCCCTCAAAAATTCC...TTGGTGGTTAAAAAAAAAAAAAAAAAAAAAAA ERCC-00164
[89]   872 GATATGCGTTACGTGAGTCTGATAGCAGTTCAC...CGTATGTGAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00165
[90]  1024 CCAATGAACTCAGCTATTCTTCTTAACAAATAA...ATGAGGTAAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00168
[91]  1023 TATTGGTGGAGGGGCACAAGTTGCTGAAGTTGC...CTTAGGTTAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00170
[92]   505 CTGGAGATTGTCTCGTACGGTTAAGAGCCTCCG...AGAACGAGAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00171

Biostrings

length(ercc)
[1] 92
width(ercc) |> head()
[1] 1061 1023  523  984  994  808
names(ercc) |> head()
[1] "ERCC-00002" "ERCC-00003" "ERCC-00004" "ERCC-00009" "ERCC-00012" "ERCC-00013"
letterFrequency(ercc, "GC", as.prob = TRUE) |> head()
           G|C
[1,] 0.5136664
[2,] 0.3264907
[3,] 0.3441683
[4,] 0.4725610
[5,] 0.5090543
[6,] 0.4282178

Biostrings

reverseComplement(ercc)
DNAStringSet object of length 92:
     width seq                                                                  names               
 [1]  1061 TTTTTTTTTTTTTTTTTTTTTTTTAAGGGTAAA...AGCAGCTTGGGCGGAAATGGAAGTAATCTGGA ERCC-00002
 [2]  1023 TTTTTTTTTTTTTTTTTTTTTTTTATCTTGTGG...AGATACAAACGCCTCTGCCTTAATCGCTGCTG ERCC-00003
 [3]   523 TTTTTTTTTTTTTTTTTTTTTTTTATGTGCCTG...CTTCTGAAAGAGACGTTATTGTTGAAGCAAGA ERCC-00004
 [4]   984 TTTTTTTTTTTTTTTTTTTTTTTTATATCGGTA...ACCATGTACTGCGCGAGACTAGCCTATCATTG ERCC-00009
 [5]   994 TTTTTTTTTTTTTTTTTTTTTTTTGTCCATAGT...CGCACACATTCCGCACCTACAAACATCTCTCG ERCC-00012
 ...   ... ...
[88]  1022 TTTTTTTTTTTTTTTTTTTTTTTAACCACCAAA...GAATTTTTGAGGGGAAAACTACTGGAGCTCCT ERCC-00164
[89]   872 TTTTTTTTTTTTTTTTTTTTTTTTCACATACGC...TGAACTGCTATCAGACTCACGTAACGCATATC ERCC-00165
[90]  1024 TTTTTTTTTTTTTTTTTTTTTTTTTACCTCATT...TATTTGTTAAGAAGAATAGCTGAGTTCATTGG ERCC-00168
[91]  1023 TTTTTTTTTTTTTTTTTTTTTTTTAACCTAAGA...CAACTTCAGCAACTTGTGCCCCTCCACCAATA ERCC-00170
[92]   505 TTTTTTTTTTTTTTTTTTTTTTTTCTCGTTCTT...GGAGGCTCTTAACCGTACGAGACAATCTCCAG ERCC-00171

Biostrings

  • XStringSet objects are easy to subset
subseq(ercc, 1, width = 10)
DNAStringSet object of length 92:
     width seq                                                                  names               
 [1]    10 TCCAGATTAC                                                           ERCC-00002
 [2]    10 CAGCAGCGAT                                                           ERCC-00003
 [3]    10 TCTTGCTTCA                                                           ERCC-00004
 [4]    10 CAATGATAGG                                                           ERCC-00009
 [5]    10 CGAGAGATGT                                                           ERCC-00012
 ...   ... ...
[88]    10 AGGAGCTCCA                                                           ERCC-00164
[89]    10 GATATGCGTT                                                           ERCC-00165
[90]    10 CCAATGAACT                                                           ERCC-00168
[91]    10 TATTGGTGGA                                                           ERCC-00170
[92]    10 CTGGAGATTG                                                           ERCC-00171

Biostrings

  • Sliding windows are simple & fast on XString objects
Views(ercc[[1]], start = 1:3, width = 10) 
Views on a 1061-letter DNAString subject
subject: TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGG...GGTGCTTGCGTTTTACCCTTAAAAAAAAAAAAAAAAAAAAAAAA
views:
      start end width
  [1]     1  10    10 [TCCAGATTAC]
  [2]     2  11    10 [CCAGATTACT]
  [3]     3  12    10 [CAGATTACTT]
Views(ercc[[1]], start = 1:3, width = 10) |> letterFrequency("GC", as.prob = TRUE)
     G|C
[1,] 0.4
[2,] 0.4
[3,] 0.3

BSgenome Objects

  • Biostrings are used to provide entire genomes \(\implies\) BSgenome objects
library(BSgenome.Hsapiens.UCSC.hg38)
hg38 <- BSgenome.Hsapiens.UCSC.hg38
hg38
| BSgenome object for Human
| - organism: Homo sapiens
| - provider: UCSC
| - genome: hg38
| - release date: 2023/01/31
| - 711 sequence(s):
|     chr1                    chr2                    chr3                   
|     chr4                    chr5                    chr6                   
|     chr7                    chr8                    chr9                   
|     chr10                   chr11                   chr12                  
|     chr13                   chr14                   chr15                  
|     ...                     ...                     ...                    
|     chr19_KV575256v1_alt    chr19_KV575257v1_alt    chr19_KV575258v1_alt   
|     chr19_KV575259v1_alt    chr19_KV575260v1_alt    chr19_MU273387v1_alt   
|     chr22_KN196485v1_alt    chr22_KN196486v1_alt    chr22_KQ458387v1_alt   
|     chr22_KQ458388v1_alt    chr22_KQ759761v1_alt    chrX_KV766199v1_alt    
|     chrX_MU273395v1_alt     chrX_MU273396v1_alt     chrX_MU273397v1_alt    
| 
| Tips: call 'seqnames()' on the object to get all the sequence names, call 'seqinfo()' to get the
| full sequence info, use the '$' or '[[' operator to access a given sequence, see '?BSgenome' for
| more information.

BSgenome Objects

  • Genomes are ‘promises’ cached on your HDD
    • Only the metadata is actually loaded
  • Can be directly loaded into memory if needed
chrM <- getSeq(hg38, "chrM")
chrM
16569-letter DNAString object
seq: GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCAT...AAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATG

BSgenome

  • seqinfo() objects provide sequence lengths & build information
sq <- seqinfo(hg38)
sq
Seqinfo object with 711 sequences (1 circular) from hg38 genome:
  seqnames             seqlengths isCircular genome
  chr1                  248956422      FALSE   hg38
  chr2                  242193529      FALSE   hg38
  chr3                  198295559      FALSE   hg38
  chr4                  190214555      FALSE   hg38
  chr5                  181538259      FALSE   hg38
  ...                         ...        ...    ...
  chr22_KQ759761v1_alt     145162      FALSE   hg38
  chrX_KV766199v1_alt      188004      FALSE   hg38
  chrX_MU273395v1_alt      619716      FALSE   hg38
  chrX_MU273396v1_alt      294119      FALSE   hg38
  chrX_MU273397v1_alt      330493      FALSE   hg38

Genomic Ranges

Genomic Ranges

  • We can work with specific regions with a genome using GRanges objects
    • The Bioconductor equivalent of a bed file
GRanges("chr1:10001-11000")
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     chr1 10001-11000      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Genomic Ranges

  • Best practice is to define which genome the range has been drawn from
GRanges("chr1:10001-11000", seqinfo = sq)
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     chr1 10001-11000      *
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

Bed Files

  • bed files are a common genomics format
  • Primarily contain ranges but can include some additional fields
chrY    10000   2781479 .   0   .
chrY    56887902    57217415    .   0   .

. . .

  • The columns are defined as
    chrom, chromStart, chromEnd, name, score & strand
  • Empty fields (name & strand) are provided as .
  • Additional columns for plotting parameters can also be passed

Bed Files

  • These are simple tsv files (sometimes space-separated)
    • We could use read_tsv() if we really want
    • Would need to use:
      read_tsv("path/to/bed", col_names = c("chrom", "chromStart", "chromEnd", "name", "score", "strand"))

. . .

  • The package rtracklayer is an all-purpose -omics file import package
    • Handles 0/1-based co-ordinates easily
    • Will parse files as GRanges

Importing Bed Files

library(rtracklayer)
parY <- import.bed("data/parY.bed", seqinfo = sq)
parY
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |        name     score
         <Rle>         <IRanges>  <Rle> | <character> <numeric>
  [1]     chrY     10001-2781479      * |        <NA>         0
  [2]     chrY 56887903-57217415      * |        <NA>         0
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

. . .

  • Any unspecified strand information \(\implies\) *

Genomic Ranges

  • Note that we have an mcols element
    • DataFrame with metadata about each range
mcols(parY)
DataFrame with 2 rows and 2 columns
         name     score
  <character> <numeric>
1          NA         0
2          NA         0

. . .

  • Can also separate the ranges
granges(parY)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]     chrY     10001-2781479      *
  [2]     chrY 56887903-57217415      *
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

The plyranges package

  • plyranges provides dplyr-like operations on GRanges objects
library(plyranges)
parY %>% mutate(name = c("PAR1", "PAR2"))
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |        name     score
         <Rle>         <IRanges>  <Rle> | <character> <numeric>
  [1]     chrY     10001-2781479      * |        PAR1         0
  [2]     chrY 56887903-57217415      * |        PAR2         0
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome
parY %>% select(score)
GRanges object with 2 ranges and 1 metadata column:
      seqnames            ranges strand |     score
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]     chrY     10001-2781479      * |         0
  [2]     chrY 56887903-57217415      * |         0
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome
  • Only for GRanges \(\implies\) not DataFrame

Genomic Ranges

  • This makes extracting the genomic sequence trivial
getSeq(hg38, parY)
DNAStringSet object of length 2:
      width seq
[1] 2771479 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAA...TCCTCAGGACTCAGACCTTAGTTATAGATTAAAAGAAGTTAA
[2]  329513 ATGTGTTTTTTGGCTGCATAAATGTCTTCTTTTGAGAAGTGTC...GGGTGTGGGTGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGT

Gene And Transcript Annotations

  • Gene and transcript annotations use a co-ordinate-based system
  • External files usually in gtf format
    • Columns 1-8: chrom, source, type, start, end, score, strand, phase
    • Column 9 is ‘;’ separated and flexible

. . .

  • Passed to R as a GRanges object with mcols
  • rtracklayer provides import.gff()
    • gtf files are based on v2 specifications of the gff format

Gene And Transcript Annotations

gtf <- import.gff("data/gencode.v41.chrY.gtf.gz", genome = "hg38")
gtf

. . .

  • plyranges can be very useful for removing redundant columns
  • Individual columns can be extracted using $
gtf$type %>% levels()
[1] "gene"        "transcript"  "exon"        "CDS"         "start_codon" "stop_codon" 
[7] "UTR"        
  • These are the types of features in a standard gtf
    • gene and transcript annotations cover the entire range
    • Individual exons within each transcripts are exons

Notice the types of features provided in a standard gtf

Genomic Ranges

This structure highlights a few unexpected features of GRanges

  1. There are no rownames or colnames
    • Ranges can have names()
    • mcols() have colnames
  2. nrow and ncol only work on mcols NOT the GRanges object
  3. The number of ranges is obtained using length()
length(gtf)
[1] 9955

Gene And Transcript Annotations

  • If we want to find the sequence of a gene or transcript
    \(\implies\) can’t use these features
subset(gtf, type == "gene")
subset(gtf, type == "transcript")

. . .

  • Instead use exons
    • The same ranges may be annotated as multiple exons
gtf %>% 
  filter(type == "exon") %>% 
  filter_by_overlaps(GRanges("chrY:284167-284314"))
  • This is a plyranges approach to the question
    • Still, note these are identical exons annotated for 10 transcripts

Gene And Transcript Annotations

  • To find every exon associated with a gene:
  1. Subset our data to exons only
  2. Split into a GRangesList using the gene_id
  3. Merge overlapping exons
gtf %>% 
  filter(type == "exon") %>% 
  splitAsList(.$gene_id)

This is the first time we’ve seen the . inactions using %>%

GRangesList Objects

  • GRangesList objects strictly have a GRanges object as every element
  • Often obtained by splitting a GRanges object based on a single mcol
  • Must have identical seqinfo elements
    • i.e. from the same genome
  • Must have mcols with common names and types
  • Very efficient way to perform complex operations

GRangesList Objects

  • The function reduce merges overlapping ranges \(\implies\) mcols are lost
  • Will handle all elements of a GRangesList
gtf %>% 
  filter(type == "exon") %>% 
  splitAsList(.$gene_id) %>% 
  GenomicRanges::reduce()

. . .

  • Notice that the original 107 exons for the first gene are now 15

Maintaining mcols is enabled using reduceMC from extraChIPs but it’s needed less often than we might think

Gene And Transcript Annotations

  • The above strategy could be used to return:
    • the overall coding length for a gene
    • the overall GC content
    • a “consensus sequence” for a gene
      \(\implies\) not guaranteed to represent what’s expressed

. . .

  • To obtain transcript sequences we split by transcript_id
    • No need to reduce overlapping exons

Gene And Transcript Annotations

trans_grl <- gtf %>% 
  filter(type == "exon") %>% 
  splitAsList(.$transcript_id) 
trans_grl

. . .

getSeq(hg38, trans_grl)
DNAStringSetList of length 1018
[["ENST00000155093.8"]] AGTGCTCCAGAGAAAGGCCGTCCTGCAGCACCCGCCGCTGTCGCCGACCGCCCGCACATCCGTCGGGTGAGTC...
[["ENST00000215473.7"]] CTGGTGGTCCAGTACCTCCAAAGATATGGAATACACTCCTGAAATATCCTGAAACCTTTTTTTTTTCAGAATC...
[["ENST00000244174.11_PAR_Y"]] GCTGTGCACCCAGAGATAGTTGGGTGACAAATCACCTCCAGGTTGGGGATGCCTCAGACTTGTGAT...
[["ENST00000250776.5"]] TGTCTGTCAGAGCTGTCAGCCTGCTTAAGCAGAGTAAAATGGTACAGGCAGTGCAGCCTGGTAGCGAGAAAAA...
[["ENST00000250784.13"]] CTCTTCCGTCGCAGAGTTTCGCCATG ...
[["ENST00000250805.5"]] TGTCTGTCAGAGCTGTCAGCCTGCTTAAGCAGAGTAAAATGGTACAGGCAGTGCAGCCTGGTAGCGAGAAAAA...
[["ENST00000250823.5"]] ATACACAGGGAGGCCAGGCAGCCTGGAGTTAGTCGACCGTTGCGAGACGTTGAGCTGCGGCAGATGAGTCCAA...
[["ENST00000250825.5"]] AGTCGACCGTTGCGAGACGTTGAGCTGCGGCAGATGAGTCCAAAGCCGAGAGCCTCGGGACCTCCGGCCAAGG...
[["ENST00000250831.6"]] ATGATTGGCCGCTGGAGGTAGGCGGGATTTCCGGGCACGGCTTCCGGCGTCCTTCCCTCTCAGGGTAGCTCCA...
[["ENST00000250838.6"]] GTAACAGGCAGGAAGAAAGCTTTCTGTACTACACCAGAGGGTTGGGGCTGTGGATTTAGCTACTCTCACCTGA...
...
<1008 more elements>

Gene And Transcript Annotations

  • Notice that each transcript retains the exon structure
getSeq(hg38, trans_grl)[[1]]
DNAStringSet object of length 8:
    width seq
[1]   389 AGTGCTCCAGAGAAAGGCCGTCCTGCAGCACCCGCCGCTGTCGC...GGGCACGGCACGGGGGCGAGAAGGCGAAGGCTGCAGGCGTGAG
[2]    89 GAGCTGTGACTAATGAGAATTAAAGGCCATGGATGAAGATGAATTTGAATTGCAGCCACAAGAGCCAAACTCATTTTTTGATGGAATAG
[3]   573 GAGCTGATGCTACACACATGGATGGTGATCAGATTGTTGTGGAA...GATAATGACAAAGCCAGCTGTGAGGACTACCTAATGATTTCGT
[4]   150 TGGATGATGCTGGCAAAATAGAACATGATGGTTCCACTGGAGTG...AAGGTGTACATTTTTAAAGCTGACCCTGGAGAAGATGACTTAG
[5]   144 GTGGAACTGTAGACATTGTGGAGAGTGAACCTGAAAATGATCAT...TATATGACTGTCAATGACTCTCAACAAGAAGATGAAGATTTAA
[6]   153 ATGTTGCTGAAATTGCTGATGAAGTTTATATGGAAGTGATCGTA...GAAATGAAAACCTTCGTACCAATTGCATGGGCAGCAGCTTATG
[7]   141 GTAATAATTCTGATGGAATTGAAAACCGGAATGGCACTGCAAGT...CCAAAGAAAAAGAGAAGACCTGATTCCAGGCAGTACCAAACAG
[8]  3697 CAATAATTATTGGCCCTGATGGTCATCCTTTGACTGTCTATCCT...AAAAATAATATTTTACTGTTAATAAATGTTTACTTGTATATGA

Gene And Transcript Annotations

  • A cleaner approach is provided by GenomicFeatures
library(GenomicFeatures)
extractTranscriptSeqs(hg38, trans_grl)
DNAStringSet object of length 1018:
       width seq                                                                names               
   [1]  5336 AGTGCTCCAGAGAAAGGCCGTCCTGCAGCACC...TTACTGTTAATAAATGTTTACTTGTATATGA ENST00000155093.8
   [2]  4595 CTGGTGGTCCAGTACCTCCAAAGATATGGAAT...AAGGTATTTTGAGCCCTTCAGAAGACATTCT ENST00000215473.7
   [3]  2038 GCTGTGCACCCAGAGATAGTTGGGTGACAAAT...TCTAGAGATTAAAAGGGGCTTGATGGCTGTT ENST00000244174.1...
   [4]   910 TGTCTGTCAGAGCTGTCAGCCTGCTTAAGCAG...AAAGACAATTAAACACTGGTATATTTCTGTT ENST00000250776.5
   [5]  1189 CTCTTCCGTCGCAGAGTTTCGCCATGGCCCGG...GGTAGAAATTAAATCCTGTGGCTGTAGGAAA ENST00000250784.13
   ...   ... ...
[1014]   534 GGCTTTGGGCAGAGGCAGCCAGGGCGCCGGAA...AAGGAGCAAATAAATATGATATGGGAGTGTG ENST00000700889.1
[1015]   600 GGCTTTGGGCAGAGGCAGCCAGGGCGCCGGAA...TAATGTTAATAAAGCTAAATGCAATTGCATA ENST00000700898.1
[1016]   676 GGCTTTGGGCAGAGGCAGCCAGGGCGCCGGAA...ATATTAAAATTAGTTTCTTAAATTTGTAAAA ENST00000700901.1
[1017]  1048 ACTCGCGGACGATCTCTGCTCGCCTTGGGCGC...ATAACTCGGAAAACAAAAACAAAAAAAAAAA ENST00000701084.1...
[1018]  1591 GGAGGTGGCTCGGGCCGCGCCGAGGGGACCAC...ATTCCAGAAAATAAAATATTTTTTGAAACCA ENST00000701359.1...

. . .

  • We could export this as a fasta file using writeXStringSet()
  • Might become a reference transcriptome (instead of genome)

AnnotationHub

Annotation Hub

  • Much processing is done on an HPC when running a sequencing workflow
  • Knowing standalone files & how to import into R is vital
  • Many annotations are also prepared as packages
  • The database of these packages is AnnotationHub
    • Doesn’t download packages, just provides a list

Annotation Hub

library(AnnotationHub)
## This will run for a minute or two the first time you run it
ah <- AnnotationHub()
ah
AnnotationHub with 71634 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFil...
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH5012"]]' 

             title                                             
  AH5012   | Chromosome Band                                   
  AH5013   | STS Markers                                       
  AH5014   | FISH Clones                                       
  AH5015   | Recomb Rate                                       
  AH5016   | ENCODE Pilot                                      
  ...        ...                                               
  AH117051 | Ensembl 112 EnsDb for Xiphophorus maculatus       
  AH117052 | Ensembl 112 EnsDb for Xenopus tropicalis          
  AH117053 | Ensembl 112 EnsDb for Zonotrichia albicollis      
  AH117054 | Ensembl 112 EnsDb for Zalophus californianus      
  AH117055 | Ensembl 112 EnsDb for Zosterops lateralis melanops
  • Data is available from multiple sources and in multiple formats
  • GRanges

Annotation Hub

  • The full metadata can be seen by accessing the mcols
mcols(ah)
DataFrame with 71634 rows and 15 columns
                          title dataprovider                species taxonomyid
                    <character>  <character>            <character>  <integer>
AH5012          Chromosome Band         UCSC           Homo sapiens       9606
AH5013              STS Markers         UCSC           Homo sapiens       9606
AH5014              FISH Clones         UCSC           Homo sapiens       9606
AH5015              Recomb Rate         UCSC           Homo sapiens       9606
AH5016             ENCODE Pilot         UCSC           Homo sapiens       9606
...                         ...          ...                    ...        ...
AH117051 Ensembl 112 EnsDb fo..      Ensembl  Xiphophorus maculatus       8083
AH117052 Ensembl 112 EnsDb fo..      Ensembl     Xenopus tropicalis       8364
AH117053 Ensembl 112 EnsDb fo..      Ensembl Zonotrichia albicollis      44394
AH117054 Ensembl 112 EnsDb fo..      Ensembl Zalophus californianus       9704
AH117055 Ensembl 112 EnsDb fo..      Ensembl Zosterops lateralis ..    1220523
                         genome            description coordinate_1_based             maintainer
                    <character>            <character>          <integer>            <character>
AH5012                     hg19 GRanges object from ..                  1 Marc Carlson <mcarls..
AH5013                     hg19 GRanges object from ..                  1 Marc Carlson <mcarls..
AH5014                     hg19 GRanges object from ..                  1 Marc Carlson <mcarls..
AH5015                     hg19 GRanges object from ..                  1 Marc Carlson <mcarls..
AH5016                     hg19 GRanges object from ..                  1 Marc Carlson <mcarls..
...                         ...                    ...                ...                    ...
AH117051   X_maculatus-5.0-male Gene and protein ann..                  1 Johannes Rainer <joh..
AH117052          UCB_Xtro_10.0 Gene and protein ann..                  1 Johannes Rainer <joh..
AH117053 Zonotrichia_albicoll.. Gene and protein ann..                  1 Johannes Rainer <joh..
AH117054           mZalCal1.pri Gene and protein ann..                  1 Johannes Rainer <joh..
AH117055            ASM128173v1 Gene and protein ann..                  1 Johannes Rainer <joh..
         rdatadateadded          preparerclass                                     tags  rdataclass
            <character>            <character>                                   <AsIs> <character>
AH5012       2013-03-26 UCSCFullTrackImportP..                  cytoBand,UCSC,track,...     GRanges
AH5013       2013-03-26 UCSCFullTrackImportP..                    stsMap,UCSC,track,...     GRanges
AH5014       2013-03-26 UCSCFullTrackImportP..                fishClones,UCSC,track,...     GRanges
AH5015       2013-03-26 UCSCFullTrackImportP..                recombRate,UCSC,track,...     GRanges
AH5016       2013-03-26 UCSCFullTrackImportP..             encodeRegions,UCSC,track,...     GRanges
...                 ...                    ...                                      ...         ...
AH117051     2024-04-30               AHEnsDbs 112,Annotation,AnnotationHubSoftware,...       EnsDb
AH117052     2024-04-30               AHEnsDbs 112,Annotation,AnnotationHubSoftware,...       EnsDb
AH117053     2024-04-30               AHEnsDbs 112,Annotation,AnnotationHubSoftware,...       EnsDb
AH117054     2024-04-30               AHEnsDbs 112,Annotation,AnnotationHubSoftware,...       EnsDb
AH117055     2024-04-30               AHEnsDbs 112,Annotation,AnnotationHubSoftware,...       EnsDb
                      rdatapath              sourceurl  sourcetype
                    <character>            <character> <character>
AH5012   goldenpath/hg19/data.. rtracklayer://hgdown..  UCSC track
AH5013   goldenpath/hg19/data.. rtracklayer://hgdown..  UCSC track
AH5014   goldenpath/hg19/data.. rtracklayer://hgdown..  UCSC track
AH5015   goldenpath/hg19/data.. rtracklayer://hgdown..  UCSC track
AH5016   goldenpath/hg19/data.. rtracklayer://hgdown..  UCSC track
...                         ...                    ...         ...
AH117051 AHEnsDbs/v112/EnsDb... http://www.ensembl.org     ensembl
AH117052 AHEnsDbs/v112/EnsDb... http://www.ensembl.org     ensembl
AH117053 AHEnsDbs/v112/EnsDb... http://www.ensembl.org     ensembl
AH117054 AHEnsDbs/v112/EnsDb... http://www.ensembl.org     ensembl
AH117055 AHEnsDbs/v112/EnsDb... http://www.ensembl.org     ensembl

Annotation Hub

  • Searching is easiest using subset() or query()
  • The following returns all EnsDb objects built on GRCh38 (hg38)
ah %>% 
  subset(genome == "GRCh38") %>%
  subset(rdataclass == "EnsDb") 
AnnotationHub with 27 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH109336 | Ensembl 108 EnsDb for Homo sapiens
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
  AH116860 | Ensembl 112 EnsDb for Homo sapiens

Annotation Hub

  • This behaves like a list \(\implies\) extract elements using [[]]
    • Will download an entire annotation set
ensdb <- ah[["AH116860"]]
ensdb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Mon Jun 17 16:14:46 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 71935.
| No. of transcripts: 279860.
|Protein data available.

EnsDb Ojects

  • The first Bioconductor packages of this type were TxDb packages
    • Used UCSC annotations & genome builds
  • Contain all information about a given annotation build
  • EnsDb packages use Ensembl resources & the package ensembldb
  • Enormous data-mapping and information content
genes(ensdb)
transcripts(ensdb)
exonsBy(ensdb, by = "tx")

EnsDb Ojects

  • Can be highly beneficial for ensuring self-contained, reproducible workflows
    • No external files required

. . .

  • Notice difference in genome build to UCSC (hg38)
  • Slight difference in column names to Gencode
  • No TxDb or EnsDb packages available for Gencode gene annotations

Gencode Annotations

  • Are available as raw gff files
ah %>% subset(dataprovider == "Gencode")
AnnotationHub with 31 records
# snapshotDate(): 2024-04-30
# $dataprovider: Gencode
# $species: Homo sapiens, Mus musculus
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
#   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH49545"]]' 

            title                                                    
  AH49545 | gencode.vM6.2wayconspseudos.gff3.gz                      
  AH49546 | gencode.vM6.annotation.gff3.gz                           
  AH49547 | gencode.vM6.basic.annotation.gff3.gz                     
  AH49548 | gencode.vM6.chr_patch_hapl_scaff.annotation.gff3.gz      
  AH49549 | gencode.vM6.chr_patch_hapl_scaff.basic.annotation.gff3.gz
  ...       ...                                                      
  AH75126 | gencode.v31.tRNAs.gff3.gz                                
  AH75127 | gencode.v31lift37.annotation.gff3.gz                     
  AH75128 | gencode.v31lift37.basic.annotation.gff3.gz               
  AH75129 | gencode.v31lift37.long_noncoding_RNAs.gff3.gz            
  AH75130 | gencode.v31lift37.unmapped.gff3.gz                       

Closing Comments

Standard (gene-level) RNA-Seq workflows:

. . .

  1. align to a reference genome
    • Produces a bam file for each sample
  2. count aligned reads which overlap an exon
    • Requires a GTF within exon-coordinates

. . .

  • Alignment requires preparing an index for your aligner (e.g. STAR, hisat2)
  • Indexes always require a reference genome in fasta format
  • Can usually obtain from UCSC, Ensembl, Gencode etc
  • Non-standard approaches can utilise the above

Closing Comments

If aligning to a reference transcriptome

  • Can download a transcriptome from Gencode etc
  • Prepare a tailored transcriptome using the above strategy
  • Index the transcriptome using salmon, kallisto etc

. . .

  • GTF not needed during counting \(\implies\) index preparation only
  • No bam file produced \(\implies\) only transcript-level counts

. . .

  • Above allows incorporation of variants into a reference transcriptomes
    • transmogR developed at BODL

Closing Comments

  • The above was overkill for a generic analysis
  • Gave introductions to fundamental data structures

. . .

  • Biostrings for DNA/RNA sequences
    • XString/Set/List objects
  • GRanges for working with co-ordinates
    • Often obtained by importing bed or gtf files
  • GRangesList for splitting GRanges by some feature