>ERCC-00002
TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGGGCGTCG
GCATCCAGACCGTCGGCTGATCGTGGTTTTACTAGGCTAGACTAGCGTAC
GAGCACTATGGTCAGTAATTCCTGGAGGAATAGGTACCAAGAAAAAAACG
AACCTTTGGGTTCCAGAGCTGTACGGTCGCACTGAACTCGGATAGGTCTC
AGAAAAACGAAATATAGGCTTACGGTAGGTCCGAATGGCACAAAGCTTGT
TCCGTTAGCTGGCATAAGATTCCATGCCTAGATGTGATACACGTTTCTGG
AAACTGCCTCGTCATGCGACTGTTCCCCGGGGTCAGGGCCGCTGGTATTT
GCTGTAAAGAGGGGCGTTGAGTCCGTCCGACTTCACTGCCCCCTTTCAGC
CTTTTGGGTCCTGTATCCCAATTCTCAGAGGTCCCGCCGTACGCTGAGGA
References and Ranges
RAdelaide 2024
Reference Sequences
Fasta Files
- Used for entire genomes or small, single sequences
- Sequence name + optional “free-form” metadata begins with
>
- Followed by sequence with no prefix
- Single sequences commonly wrapped on multiple lines
- Sequence can be DNA, RNA or Protein
Sometimes use a standardised set of IUPAC codes
Biostrings
- The Bioconductor package for handling these is
Biostrings
- Depending on the sequence structure
\(\implies\)DNAString
,DNAStringSet
orDNAStringSetList
library(Biostrings)
DNAString("AACGT")
5-letter DNAString object
seq: AACGT
DNAStringSet("AACGT")
DNAStringSet object of length 1:
width seq
[1] 5 AACGT
DNAStringSetList("AACGT")
DNAStringSetList of length 1
[[1]] AACGT
By have a new class (not just characters) many operations become simplified
Biostrings
- General class for RNA or Proteins \(\implies\)
XString/Set/List
RNAString*
orAAString*
XStringSet
objects are lists ofXString
objectsXStringSetList
objects are lists ofXStringSet
objects
DNAStringSet("AACGT") |> unlist()
5-letter DNAString object
seq: AACGT
DNAStringSetList("AACGT") |> unlist()
DNAStringSet object of length 1:
width seq
[1] 5 AACGT
Biostrings
- If working with complete transcripts \(\implies\)
XStringSet
- When needing exon structure within transcripts \(\implies\)
XStringSetList
Biostrings
- Using some example ERCC spike-in data
- Short, known sequences sometimes used for RNA-Seq QC + Normalisation
<- readDNAStringSet("data/ERCC92.fa.gz")
ercc ercc
DNAStringSet object of length 92:
width seq names
[1] 1061 TCCAGATTACTTCCATTTCCGCCCAAGCTGCTC...TTACCCTTAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00002
[2] 1023 CAGCAGCGATTAAGGCAGAGGCGTTTGTATCTG...CACAAGATAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00003
[3] 523 TCTTGCTTCAACAATAACGTCTCTTTCAGAAGG...AGGCACATAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00004
[4] 984 CAATGATAGGCTAGTCTCGCGCAGTACATGGTA...ACCGATATAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00009
[5] 994 CGAGAGATGTTTGTAGGTGCGGAATGTGTGCGG...CTATGGACAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00012
... ... ...
[88] 1022 AGGAGCTCCAGTAGTTTTCCCCTCAAAAATTCC...TTGGTGGTTAAAAAAAAAAAAAAAAAAAAAAA ERCC-00164
[89] 872 GATATGCGTTACGTGAGTCTGATAGCAGTTCAC...CGTATGTGAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00165
[90] 1024 CCAATGAACTCAGCTATTCTTCTTAACAAATAA...ATGAGGTAAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00168
[91] 1023 TATTGGTGGAGGGGCACAAGTTGCTGAAGTTGC...CTTAGGTTAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00170
[92] 505 CTGGAGATTGTCTCGTACGGTTAAGAGCCTCCG...AGAACGAGAAAAAAAAAAAAAAAAAAAAAAAA ERCC-00171
Biostrings
length(ercc)
[1] 92
width(ercc) |> head()
[1] 1061 1023 523 984 994 808
names(ercc) |> head()
[1] "ERCC-00002" "ERCC-00003" "ERCC-00004" "ERCC-00009" "ERCC-00012" "ERCC-00013"
letterFrequency(ercc, "GC", as.prob = TRUE) |> head()
G|C
[1,] 0.5136664
[2,] 0.3264907
[3,] 0.3441683
[4,] 0.4725610
[5,] 0.5090543
[6,] 0.4282178
Biostrings
reverseComplement(ercc)
DNAStringSet object of length 92:
width seq names
[1] 1061 TTTTTTTTTTTTTTTTTTTTTTTTAAGGGTAAA...AGCAGCTTGGGCGGAAATGGAAGTAATCTGGA ERCC-00002
[2] 1023 TTTTTTTTTTTTTTTTTTTTTTTTATCTTGTGG...AGATACAAACGCCTCTGCCTTAATCGCTGCTG ERCC-00003
[3] 523 TTTTTTTTTTTTTTTTTTTTTTTTATGTGCCTG...CTTCTGAAAGAGACGTTATTGTTGAAGCAAGA ERCC-00004
[4] 984 TTTTTTTTTTTTTTTTTTTTTTTTATATCGGTA...ACCATGTACTGCGCGAGACTAGCCTATCATTG ERCC-00009
[5] 994 TTTTTTTTTTTTTTTTTTTTTTTTGTCCATAGT...CGCACACATTCCGCACCTACAAACATCTCTCG ERCC-00012
... ... ...
[88] 1022 TTTTTTTTTTTTTTTTTTTTTTTAACCACCAAA...GAATTTTTGAGGGGAAAACTACTGGAGCTCCT ERCC-00164
[89] 872 TTTTTTTTTTTTTTTTTTTTTTTTCACATACGC...TGAACTGCTATCAGACTCACGTAACGCATATC ERCC-00165
[90] 1024 TTTTTTTTTTTTTTTTTTTTTTTTTACCTCATT...TATTTGTTAAGAAGAATAGCTGAGTTCATTGG ERCC-00168
[91] 1023 TTTTTTTTTTTTTTTTTTTTTTTTAACCTAAGA...CAACTTCAGCAACTTGTGCCCCTCCACCAATA ERCC-00170
[92] 505 TTTTTTTTTTTTTTTTTTTTTTTTCTCGTTCTT...GGAGGCTCTTAACCGTACGAGACAATCTCCAG ERCC-00171
Biostrings
XStringSet
objects are easy to subset
subseq(ercc, 1, width = 10)
DNAStringSet object of length 92:
width seq names
[1] 10 TCCAGATTAC ERCC-00002
[2] 10 CAGCAGCGAT ERCC-00003
[3] 10 TCTTGCTTCA ERCC-00004
[4] 10 CAATGATAGG ERCC-00009
[5] 10 CGAGAGATGT ERCC-00012
... ... ...
[88] 10 AGGAGCTCCA ERCC-00164
[89] 10 GATATGCGTT ERCC-00165
[90] 10 CCAATGAACT ERCC-00168
[91] 10 TATTGGTGGA ERCC-00170
[92] 10 CTGGAGATTG ERCC-00171
Biostrings
- Sliding windows are simple & fast on
XString
objects
Views(ercc[[1]], start = 1:3, width = 10)
Views on a 1061-letter DNAString subject
subject: TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGG...GGTGCTTGCGTTTTACCCTTAAAAAAAAAAAAAAAAAAAAAAAA
views:
start end width
[1] 1 10 10 [TCCAGATTAC]
[2] 2 11 10 [CCAGATTACT]
[3] 3 12 10 [CAGATTACTT]
Views(ercc[[1]], start = 1:3, width = 10) |> letterFrequency("GC", as.prob = TRUE)
G|C
[1,] 0.4
[2,] 0.4
[3,] 0.3
BSgenome Objects
- Biostrings are used to provide entire genomes \(\implies\)
BSgenome
objects
library(BSgenome.Hsapiens.UCSC.hg38)
<- BSgenome.Hsapiens.UCSC.hg38
hg38 hg38
| BSgenome object for Human
| - organism: Homo sapiens
| - provider: UCSC
| - genome: hg38
| - release date: 2023/01/31
| - 711 sequence(s):
| chr1 chr2 chr3
| chr4 chr5 chr6
| chr7 chr8 chr9
| chr10 chr11 chr12
| chr13 chr14 chr15
| ... ... ...
| chr19_KV575256v1_alt chr19_KV575257v1_alt chr19_KV575258v1_alt
| chr19_KV575259v1_alt chr19_KV575260v1_alt chr19_MU273387v1_alt
| chr22_KN196485v1_alt chr22_KN196486v1_alt chr22_KQ458387v1_alt
| chr22_KQ458388v1_alt chr22_KQ759761v1_alt chrX_KV766199v1_alt
| chrX_MU273395v1_alt chrX_MU273396v1_alt chrX_MU273397v1_alt
|
| Tips: call 'seqnames()' on the object to get all the sequence names, call 'seqinfo()' to get the
| full sequence info, use the '$' or '[[' operator to access a given sequence, see '?BSgenome' for
| more information.
BSgenome Objects
- Genomes are ‘promises’ cached on your HDD
- Only the metadata is actually loaded
- Can be directly loaded into memory if needed
<- getSeq(hg38, "chrM")
chrM chrM
16569-letter DNAString object
seq: GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCAT...AAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATG
BSgenome
seqinfo()
objects provide sequence lengths & build information
<- seqinfo(hg38)
sq sq
Seqinfo object with 711 sequences (1 circular) from hg38 genome:
seqnames seqlengths isCircular genome
chr1 248956422 FALSE hg38
chr2 242193529 FALSE hg38
chr3 198295559 FALSE hg38
chr4 190214555 FALSE hg38
chr5 181538259 FALSE hg38
... ... ... ...
chr22_KQ759761v1_alt 145162 FALSE hg38
chrX_KV766199v1_alt 188004 FALSE hg38
chrX_MU273395v1_alt 619716 FALSE hg38
chrX_MU273396v1_alt 294119 FALSE hg38
chrX_MU273397v1_alt 330493 FALSE hg38
Genomic Ranges
Genomic Ranges
- We can work with specific regions with a genome using
GRanges
objects- The Bioconductor equivalent of a
bed
file
- The Bioconductor equivalent of a
GRanges("chr1:10001-11000")
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 10001-11000 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Genomic Ranges
- Best practice is to define which genome the range has been drawn from
GRanges("chr1:10001-11000", seqinfo = sq)
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 10001-11000 *
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
Bed Files
bed
files are a common genomics format- Primarily contain ranges but can include some additional fields
chrY 10000 2781479 . 0 .
chrY 56887902 57217415 . 0 .
. . .
- The columns are defined as
chrom
,chromStart
,chromEnd
,name
,score
&strand
- Empty fields (
name
&strand
) are provided as.
- Additional columns for plotting parameters can also be passed
Bed Files
- These are simple
tsv
files (sometimes space-separated)- We could use
read_tsv()
if we really want - Would need to use:
read_tsv("path/to/bed", col_names = c("chrom", "chromStart", "chromEnd", "name", "score", "strand"))
- We could use
. . .
- The package
rtracklayer
is an all-purpose -omics file import package- Handles 0/1-based co-ordinates easily
- Will parse files as
GRanges
Importing Bed Files
library(rtracklayer)
<- import.bed("data/parY.bed", seqinfo = sq)
parY parY
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chrY 10001-2781479 * | <NA> 0
[2] chrY 56887903-57217415 * | <NA> 0
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
. . .
- Any unspecified
strand
information \(\implies\)*
Genomic Ranges
- Note that we have an
mcols
elementDataFrame
with metadata about each range
mcols(parY)
DataFrame with 2 rows and 2 columns
name score
<character> <numeric>
1 NA 0
2 NA 0
. . .
- Can also separate the ranges
granges(parY)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chrY 10001-2781479 *
[2] chrY 56887903-57217415 *
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
The plyranges
package
plyranges
providesdplyr
-like operations onGRanges
objects
library(plyranges)
%>% mutate(name = c("PAR1", "PAR2")) parY
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chrY 10001-2781479 * | PAR1 0
[2] chrY 56887903-57217415 * | PAR2 0
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
%>% select(score) parY
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chrY 10001-2781479 * | 0
[2] chrY 56887903-57217415 * | 0
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
- Only for
GRanges
\(\implies\) notDataFrame
Genomic Ranges
- This makes extracting the genomic sequence trivial
getSeq(hg38, parY)
DNAStringSet object of length 2:
width seq
[1] 2771479 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAA...TCCTCAGGACTCAGACCTTAGTTATAGATTAAAAGAAGTTAA
[2] 329513 ATGTGTTTTTTGGCTGCATAAATGTCTTCTTTTGAGAAGTGTC...GGGTGTGGGTGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGT
Gene And Transcript Annotations
- Gene and transcript annotations use a co-ordinate-based system
- External files usually in
gtf
format- Columns 1-8:
chrom
,source
,type
,start
,end
,score
,strand
,phase
- Column 9 is ‘
;
’ separated and flexible
- Columns 1-8:
. . .
- Passed to
R
as aGRanges
object withmcols
rtracklayer
providesimport.gff()
gtf
files are based on v2 specifications of thegff
format
Gene And Transcript Annotations
<- import.gff("data/gencode.v41.chrY.gtf.gz", genome = "hg38")
gtf gtf
. . .
plyranges
can be very useful for removing redundant columns- Individual columns can be extracted using
$
$type %>% levels() gtf
[1] "gene" "transcript" "exon" "CDS" "start_codon" "stop_codon"
[7] "UTR"
- These are the types of features in a standard gtf
gene
andtranscript
annotations cover the entire range- Individual exons within each transcripts are
exons
Notice the types of features provided in a standard gtf
Genomic Ranges
This structure highlights a few unexpected features of GRanges
- There are no rownames or colnames
- Ranges can have
names()
mcols()
have colnames
- Ranges can have
nrow
andncol
only work onmcols
NOT theGRanges
object- The number of ranges is obtained using
length()
length(gtf)
[1] 9955
Gene And Transcript Annotations
- If we want to find the sequence of a gene or transcript
\(\implies\) can’t use these features
subset(gtf, type == "gene")
subset(gtf, type == "transcript")
. . .
- Instead use exons
- The same ranges may be annotated as multiple exons
%>%
gtf filter(type == "exon") %>%
filter_by_overlaps(GRanges("chrY:284167-284314"))
- This is a
plyranges
approach to the question- Still, note these are identical exons annotated for 10 transcripts
Gene And Transcript Annotations
- To find every exon associated with a gene:
- Subset our data to exons only
- Split into a
GRangesList
using the gene_id - Merge overlapping exons
%>%
gtf filter(type == "exon") %>%
splitAsList(.$gene_id)
This is the first time we’ve seen the .
inactions using %>%
GRangesList Objects
GRangesList
objects strictly have aGRanges
object as every element- Often obtained by splitting a
GRanges
object based on a singlemcol
- Must have identical
seqinfo
elements- i.e. from the same genome
- Must have
mcols
with common names and types - Very efficient way to perform complex operations
GRangesList Objects
- The function
reduce
merges overlapping ranges \(\implies\)mcols
are lost - Will handle all elements of a
GRangesList
%>%
gtf filter(type == "exon") %>%
splitAsList(.$gene_id) %>%
::reduce() GenomicRanges
. . .
- Notice that the original 107 exons for the first gene are now 15
Maintaining mcols is enabled using reduceMC
from extraChIPs
but it’s needed less often than we might think
Gene And Transcript Annotations
- The above strategy could be used to return:
- the overall coding length for a gene
- the overall GC content
- a “consensus sequence” for a gene
\(\implies\) not guaranteed to represent what’s expressed
. . .
- To obtain transcript sequences we split by transcript_id
- No need to reduce overlapping exons
Gene And Transcript Annotations
<- gtf %>%
trans_grl filter(type == "exon") %>%
splitAsList(.$transcript_id)
trans_grl
. . .
getSeq(hg38, trans_grl)
DNAStringSetList of length 1018
[["ENST00000155093.8"]] AGTGCTCCAGAGAAAGGCCGTCCTGCAGCACCCGCCGCTGTCGCCGACCGCCCGCACATCCGTCGGGTGAGTC...
[["ENST00000215473.7"]] CTGGTGGTCCAGTACCTCCAAAGATATGGAATACACTCCTGAAATATCCTGAAACCTTTTTTTTTTCAGAATC...
[["ENST00000244174.11_PAR_Y"]] GCTGTGCACCCAGAGATAGTTGGGTGACAAATCACCTCCAGGTTGGGGATGCCTCAGACTTGTGAT...
[["ENST00000250776.5"]] TGTCTGTCAGAGCTGTCAGCCTGCTTAAGCAGAGTAAAATGGTACAGGCAGTGCAGCCTGGTAGCGAGAAAAA...
[["ENST00000250784.13"]] CTCTTCCGTCGCAGAGTTTCGCCATG ...
[["ENST00000250805.5"]] TGTCTGTCAGAGCTGTCAGCCTGCTTAAGCAGAGTAAAATGGTACAGGCAGTGCAGCCTGGTAGCGAGAAAAA...
[["ENST00000250823.5"]] ATACACAGGGAGGCCAGGCAGCCTGGAGTTAGTCGACCGTTGCGAGACGTTGAGCTGCGGCAGATGAGTCCAA...
[["ENST00000250825.5"]] AGTCGACCGTTGCGAGACGTTGAGCTGCGGCAGATGAGTCCAAAGCCGAGAGCCTCGGGACCTCCGGCCAAGG...
[["ENST00000250831.6"]] ATGATTGGCCGCTGGAGGTAGGCGGGATTTCCGGGCACGGCTTCCGGCGTCCTTCCCTCTCAGGGTAGCTCCA...
[["ENST00000250838.6"]] GTAACAGGCAGGAAGAAAGCTTTCTGTACTACACCAGAGGGTTGGGGCTGTGGATTTAGCTACTCTCACCTGA...
...
<1008 more elements>
Gene And Transcript Annotations
- Notice that each transcript retains the exon structure
getSeq(hg38, trans_grl)[[1]]
DNAStringSet object of length 8:
width seq
[1] 389 AGTGCTCCAGAGAAAGGCCGTCCTGCAGCACCCGCCGCTGTCGC...GGGCACGGCACGGGGGCGAGAAGGCGAAGGCTGCAGGCGTGAG
[2] 89 GAGCTGTGACTAATGAGAATTAAAGGCCATGGATGAAGATGAATTTGAATTGCAGCCACAAGAGCCAAACTCATTTTTTGATGGAATAG
[3] 573 GAGCTGATGCTACACACATGGATGGTGATCAGATTGTTGTGGAA...GATAATGACAAAGCCAGCTGTGAGGACTACCTAATGATTTCGT
[4] 150 TGGATGATGCTGGCAAAATAGAACATGATGGTTCCACTGGAGTG...AAGGTGTACATTTTTAAAGCTGACCCTGGAGAAGATGACTTAG
[5] 144 GTGGAACTGTAGACATTGTGGAGAGTGAACCTGAAAATGATCAT...TATATGACTGTCAATGACTCTCAACAAGAAGATGAAGATTTAA
[6] 153 ATGTTGCTGAAATTGCTGATGAAGTTTATATGGAAGTGATCGTA...GAAATGAAAACCTTCGTACCAATTGCATGGGCAGCAGCTTATG
[7] 141 GTAATAATTCTGATGGAATTGAAAACCGGAATGGCACTGCAAGT...CCAAAGAAAAAGAGAAGACCTGATTCCAGGCAGTACCAAACAG
[8] 3697 CAATAATTATTGGCCCTGATGGTCATCCTTTGACTGTCTATCCT...AAAAATAATATTTTACTGTTAATAAATGTTTACTTGTATATGA
Gene And Transcript Annotations
- A cleaner approach is provided by
GenomicFeatures
library(GenomicFeatures)
extractTranscriptSeqs(hg38, trans_grl)
DNAStringSet object of length 1018:
width seq names
[1] 5336 AGTGCTCCAGAGAAAGGCCGTCCTGCAGCACC...TTACTGTTAATAAATGTTTACTTGTATATGA ENST00000155093.8
[2] 4595 CTGGTGGTCCAGTACCTCCAAAGATATGGAAT...AAGGTATTTTGAGCCCTTCAGAAGACATTCT ENST00000215473.7
[3] 2038 GCTGTGCACCCAGAGATAGTTGGGTGACAAAT...TCTAGAGATTAAAAGGGGCTTGATGGCTGTT ENST00000244174.1...
[4] 910 TGTCTGTCAGAGCTGTCAGCCTGCTTAAGCAG...AAAGACAATTAAACACTGGTATATTTCTGTT ENST00000250776.5
[5] 1189 CTCTTCCGTCGCAGAGTTTCGCCATGGCCCGG...GGTAGAAATTAAATCCTGTGGCTGTAGGAAA ENST00000250784.13
... ... ...
[1014] 534 GGCTTTGGGCAGAGGCAGCCAGGGCGCCGGAA...AAGGAGCAAATAAATATGATATGGGAGTGTG ENST00000700889.1
[1015] 600 GGCTTTGGGCAGAGGCAGCCAGGGCGCCGGAA...TAATGTTAATAAAGCTAAATGCAATTGCATA ENST00000700898.1
[1016] 676 GGCTTTGGGCAGAGGCAGCCAGGGCGCCGGAA...ATATTAAAATTAGTTTCTTAAATTTGTAAAA ENST00000700901.1
[1017] 1048 ACTCGCGGACGATCTCTGCTCGCCTTGGGCGC...ATAACTCGGAAAACAAAAACAAAAAAAAAAA ENST00000701084.1...
[1018] 1591 GGAGGTGGCTCGGGCCGCGCCGAGGGGACCAC...ATTCCAGAAAATAAAATATTTTTTGAAACCA ENST00000701359.1...
. . .
- We could export this as a
fasta
file usingwriteXStringSet()
- Might become a reference transcriptome (instead of genome)
AnnotationHub
Annotation Hub
- Much processing is done on an HPC when running a sequencing workflow
- Knowing standalone files & how to import into
R
is vital - Many annotations are also prepared as packages
- The database of these packages is
AnnotationHub
- Doesn’t download packages, just provides a list
Annotation Hub
library(AnnotationHub)
## This will run for a minute or two the first time you run it
<- AnnotationHub()
ah ah
AnnotationHub with 71634 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFil...
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH5012"]]'
title
AH5012 | Chromosome Band
AH5013 | STS Markers
AH5014 | FISH Clones
AH5015 | Recomb Rate
AH5016 | ENCODE Pilot
... ...
AH117051 | Ensembl 112 EnsDb for Xiphophorus maculatus
AH117052 | Ensembl 112 EnsDb for Xenopus tropicalis
AH117053 | Ensembl 112 EnsDb for Zonotrichia albicollis
AH117054 | Ensembl 112 EnsDb for Zalophus californianus
AH117055 | Ensembl 112 EnsDb for Zosterops lateralis melanops
- Data is available from multiple sources and in multiple formats
- GRanges
Annotation Hub
- The full metadata can be seen by accessing the
mcols
mcols(ah)
DataFrame with 71634 rows and 15 columns
title dataprovider species taxonomyid
<character> <character> <character> <integer>
AH5012 Chromosome Band UCSC Homo sapiens 9606
AH5013 STS Markers UCSC Homo sapiens 9606
AH5014 FISH Clones UCSC Homo sapiens 9606
AH5015 Recomb Rate UCSC Homo sapiens 9606
AH5016 ENCODE Pilot UCSC Homo sapiens 9606
... ... ... ... ...
AH117051 Ensembl 112 EnsDb fo.. Ensembl Xiphophorus maculatus 8083
AH117052 Ensembl 112 EnsDb fo.. Ensembl Xenopus tropicalis 8364
AH117053 Ensembl 112 EnsDb fo.. Ensembl Zonotrichia albicollis 44394
AH117054 Ensembl 112 EnsDb fo.. Ensembl Zalophus californianus 9704
AH117055 Ensembl 112 EnsDb fo.. Ensembl Zosterops lateralis .. 1220523
genome description coordinate_1_based maintainer
<character> <character> <integer> <character>
AH5012 hg19 GRanges object from .. 1 Marc Carlson <mcarls..
AH5013 hg19 GRanges object from .. 1 Marc Carlson <mcarls..
AH5014 hg19 GRanges object from .. 1 Marc Carlson <mcarls..
AH5015 hg19 GRanges object from .. 1 Marc Carlson <mcarls..
AH5016 hg19 GRanges object from .. 1 Marc Carlson <mcarls..
... ... ... ... ...
AH117051 X_maculatus-5.0-male Gene and protein ann.. 1 Johannes Rainer <joh..
AH117052 UCB_Xtro_10.0 Gene and protein ann.. 1 Johannes Rainer <joh..
AH117053 Zonotrichia_albicoll.. Gene and protein ann.. 1 Johannes Rainer <joh..
AH117054 mZalCal1.pri Gene and protein ann.. 1 Johannes Rainer <joh..
AH117055 ASM128173v1 Gene and protein ann.. 1 Johannes Rainer <joh..
rdatadateadded preparerclass tags rdataclass
<character> <character> <AsIs> <character>
AH5012 2013-03-26 UCSCFullTrackImportP.. cytoBand,UCSC,track,... GRanges
AH5013 2013-03-26 UCSCFullTrackImportP.. stsMap,UCSC,track,... GRanges
AH5014 2013-03-26 UCSCFullTrackImportP.. fishClones,UCSC,track,... GRanges
AH5015 2013-03-26 UCSCFullTrackImportP.. recombRate,UCSC,track,... GRanges
AH5016 2013-03-26 UCSCFullTrackImportP.. encodeRegions,UCSC,track,... GRanges
... ... ... ... ...
AH117051 2024-04-30 AHEnsDbs 112,Annotation,AnnotationHubSoftware,... EnsDb
AH117052 2024-04-30 AHEnsDbs 112,Annotation,AnnotationHubSoftware,... EnsDb
AH117053 2024-04-30 AHEnsDbs 112,Annotation,AnnotationHubSoftware,... EnsDb
AH117054 2024-04-30 AHEnsDbs 112,Annotation,AnnotationHubSoftware,... EnsDb
AH117055 2024-04-30 AHEnsDbs 112,Annotation,AnnotationHubSoftware,... EnsDb
rdatapath sourceurl sourcetype
<character> <character> <character>
AH5012 goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
AH5013 goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
AH5014 goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
AH5015 goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
AH5016 goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
... ... ... ...
AH117051 AHEnsDbs/v112/EnsDb... http://www.ensembl.org ensembl
AH117052 AHEnsDbs/v112/EnsDb... http://www.ensembl.org ensembl
AH117053 AHEnsDbs/v112/EnsDb... http://www.ensembl.org ensembl
AH117054 AHEnsDbs/v112/EnsDb... http://www.ensembl.org ensembl
AH117055 AHEnsDbs/v112/EnsDb... http://www.ensembl.org ensembl
Annotation Hub
- Searching is easiest using
subset()
orquery()
- The following returns all EnsDb objects built on GRCh38 (hg38)
%>%
ah subset(genome == "GRCh38") %>%
subset(rdataclass == "EnsDb")
AnnotationHub with 27 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
AH56681 | Ensembl 89 EnsDb for Homo Sapiens
AH57757 | Ensembl 90 EnsDb for Homo Sapiens
AH60773 | Ensembl 91 EnsDb for Homo Sapiens
... ...
AH109336 | Ensembl 108 EnsDb for Homo sapiens
AH109606 | Ensembl 109 EnsDb for Homo sapiens
AH113665 | Ensembl 110 EnsDb for Homo sapiens
AH116291 | Ensembl 111 EnsDb for Homo sapiens
AH116860 | Ensembl 112 EnsDb for Homo sapiens
Annotation Hub
- This behaves like a list \(\implies\) extract elements using
[[]]
- Will download an entire annotation set
<- ah[["AH116860"]]
ensdb ensdb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Mon Jun 17 16:14:46 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 71935.
| No. of transcripts: 279860.
|Protein data available.
EnsDb Ojects
- The first Bioconductor packages of this type were TxDb packages
- Used UCSC annotations & genome builds
- Contain all information about a given annotation build
- EnsDb packages use Ensembl resources & the package
ensembldb
- Enormous data-mapping and information content
genes(ensdb)
transcripts(ensdb)
exonsBy(ensdb, by = "tx")
EnsDb Ojects
- Can be highly beneficial for ensuring self-contained, reproducible workflows
- No external files required
. . .
- Notice difference in genome build to UCSC (
hg38
) - Slight difference in column names to Gencode
- No TxDb or EnsDb packages available for Gencode gene annotations
Gencode Annotations
- Are available as raw gff files
%>% subset(dataprovider == "Gencode") ah
AnnotationHub with 31 records
# snapshotDate(): 2024-04-30
# $dataprovider: Gencode
# $species: Homo sapiens, Mus musculus
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
# rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH49545"]]'
title
AH49545 | gencode.vM6.2wayconspseudos.gff3.gz
AH49546 | gencode.vM6.annotation.gff3.gz
AH49547 | gencode.vM6.basic.annotation.gff3.gz
AH49548 | gencode.vM6.chr_patch_hapl_scaff.annotation.gff3.gz
AH49549 | gencode.vM6.chr_patch_hapl_scaff.basic.annotation.gff3.gz
... ...
AH75126 | gencode.v31.tRNAs.gff3.gz
AH75127 | gencode.v31lift37.annotation.gff3.gz
AH75128 | gencode.v31lift37.basic.annotation.gff3.gz
AH75129 | gencode.v31lift37.long_noncoding_RNAs.gff3.gz
AH75130 | gencode.v31lift37.unmapped.gff3.gz
Closing Comments
Standard (gene-level) RNA-Seq workflows:
. . .
- align to a reference genome
- Produces a
bam
file for each sample
- Produces a
- count aligned reads which overlap an exon
- Requires a GTF within exon-coordinates
. . .
- Alignment requires preparing an index for your aligner (e.g. STAR, hisat2)
- Indexes always require a reference genome in
fasta
format - Can usually obtain from UCSC, Ensembl, Gencode etc
- Non-standard approaches can utilise the above
Closing Comments
If aligning to a reference transcriptome
- Can download a transcriptome from Gencode etc
- Prepare a tailored transcriptome using the above strategy
- Index the transcriptome using
salmon
,kallisto
etc
. . .
- GTF not needed during counting \(\implies\) index preparation only
- No
bam
file produced \(\implies\) only transcript-level counts
. . .
- Above allows incorporation of variants into a reference transcriptomes
transmogR
developed at BODL
Closing Comments
- The above was overkill for a generic analysis
- Gave introductions to fundamental data structures
. . .
Biostrings
for DNA/RNA sequencesXString/Set/List
objects
GRanges
for working with co-ordinates- Often obtained by importing
bed
orgtf
files
- Often obtained by importing
GRangesList
for splittingGRanges
by some feature