| extractTranscriptSeqs {GenomicFeatures} | R Documentation |
extractTranscriptSeqs extracts transcript or CDS sequences from
an object representing a single chromosome or a collection of chromosomes.
extractTranscriptSeqs(x, transcripts, ...) ## S4 method for signature 'DNAString' extractTranscriptSeqs(x, transcripts, strand="+") ## S4 method for signature 'ANY' extractTranscriptSeqs(x, transcripts)
x |
An object representing a single chromosome or a collection of chromosomes.
More precisely, Other objects representing a collection of chromosomes are supported
(e.g. FaFile objects in the Rsamtools package)
as long as |
transcripts |
An object representing the exon ranges of each transcript to extract. More precisely:
|
... |
Additional arguments, for use in specific methods. |
strand |
Only supported when Can be an atomic vector, a factor, or an Rle object,
in which case it indicates the strand of each transcript (i.e. all the
exons in a transcript are considered to be on the same strand).
More precisely: it's turned into a factor (or factor-Rle)
that has the "standard strand levels" (this is done by calling the
|
A DNAStringSet object parallel to
transcripts, that is, the i-th element in the returned object
is the sequence of the i-th transcript in transcripts.
H. Pages
The transcriptLocs2refLocs function for converting
transcript-based locations into reference-based locations.
The available.genomes function in the
BSgenome package for checking avaibility of BSgenome
data packages (and installing the desired one).
The DNAString and DNAStringSet classes defined and documented in the Biostrings package.
The translate function in the
Biostrings package for translating DNA or RNA sequences
into amino acid sequences.
The GRangesList class defined and documented in the GenomicRanges package.
The RangesList class defined and documented in the IRanges package.
The exonsBy function for extracting exon ranges
grouped by transcript.
The TxDb class.
## ---------------------------------------------------------------------
## 1. A TOY EXAMPLE
## ---------------------------------------------------------------------
library(Biostrings)
## A chromosome of length 30:
x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC")
## 2 transcripts on 'x':
tx1 <- IRanges(1, 8) # 1 exon
tx2 <- c(tx1, IRanges(12, 30)) # 2 exons
transcripts <- IRangesList(tx1=tx1, tx2=tx2)
extractTranscriptSeqs(x, transcripts)
## By default, all the exons are considered to be on the plus strand.
## We can use the 'strand' argument to tell extractTranscriptSeqs()
## to extract them from the minus strand.
## Extract all the exons from the minus strand:
extractTranscriptSeqs(x, transcripts, strand="-")
extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-")
## Extract the exon of the 1st transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=c("-", "+"))
## Extract the 2nd exon of the 2nd transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-")))
## ---------------------------------------------------------------------
## 2. A REAL EXAMPLE
## ---------------------------------------------------------------------
## Load a genome:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
## Load a TxDb object:
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)
## Check that 'txdb' is based on the hg19 assembly:
txdb
## Extract the exon ranges grouped by transcript from 'txdb':
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
## Extract the transcript sequences from the genome:
tx_seqs <- extractTranscriptSeqs(genome, transcripts)
tx_seqs
## A sanity check:
stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts)))))
## ---------------------------------------------------------------------
## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES
## ---------------------------------------------------------------------
cds <- cdsBy(txdb, by="tx", use.names=TRUE)
cds_seqs <- extractTranscriptSeqs(genome, cds)
cds_seqs
## A sanity check:
stopifnot(identical(width(cds_seqs), unname(sum(width(cds)))))
## Note that, alternatively, the CDS sequences can be obtained from the
## transcript sequences by removing the 5' and 3' UTRs:
five_utr_width <- sum(width(fiveUTRsByTranscript(txdb, use.names=TRUE)))
five_utr_width <- five_utr_width[names(cds_seqs)]
five_utr_width[is.na(five_utr_width)] <- 0L
three_utr_width <- sum(width(threeUTRsByTranscript(txdb, use.names=TRUE)))
three_utr_width <- three_utr_width[names(cds_seqs)]
three_utr_width[is.na(three_utr_width)] <- 0L
cds_seqs2 <- narrow(tx_seqs[names(cds_seqs)],
start=five_utr_width+1L,
end=-(three_utr_width+1L))
stopifnot(identical(as.character(cds_seqs), as.character(cds_seqs2)))
## ---------------------------------------------------------------------
## 4. TRANSLATE THE CDS SEQUENCES
## ---------------------------------------------------------------------
prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve")
## Note that, by default, translate() uses The Standard Genetic Code to
## translate codons into amino acids. However, depending on the organism,
## a different genetic code might be needed to translate CDS sequences
## located on the mitochodrial chromosome. For example, for vertebrates,
## the following code could be used to correct 'prot_seqs':
SGC1 <- getGeneticCode("SGC1")
chrM_idx <- which(all(seqnames(cds) == "chrM"))
prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1,
if.fuzzy.codon="solve")