| mapToTranscripts {GenomicFeatures} | R Documentation |
Map range coordinates between features in the transcriptome and genome (reference) space.
See ?mapToAlignments in the
GenomicAlignments package for mapping coordinates between
reads (local) and genome (reference) space using a CIGAR alignment.
## S4 method for signature 'GenomicRanges,GRangesList'
mapToTranscripts(x, transcripts,
ignore.strand = TRUE, ...)
## S4 method for signature 'ANY,TxDb'
mapToTranscripts(x, transcripts, ignore.strand = TRUE,
extractor.fun = GenomicFeatures::transcripts, ...)
## S4 method for signature 'GenomicRanges,GRangesList'
pmapToTranscripts(x, transcripts,
ignore.strand = TRUE, ...)
## S4 method for signature 'GenomicRanges,GRangesList'
mapFromTranscripts(x, transcripts,
ignore.strand = TRUE, ...)
## S4 method for signature 'GenomicRanges,GRangesList'
pmapFromTranscripts(x, transcripts,
ignore.strand = TRUE, ...)
x |
|
transcripts |
A named The |
ignore.strand |
When TRUE, strand is ignored in overlap operations. When
|
extractor.fun |
Function to extract genomic features from a This argument is only applicable to Valid
|
... |
Additional arguments passed to |
mapToTranscripts, pmapToTranscripts
The genomic range in x is mapped to the local position in the
transcripts ranges. A successful mapping occurs when x
is completely within the transcripts range, equivalent to:
findOverlaps(..., type="within")
Transcriptome-based coordinates start counting at 1 at the beginning
of the transcripts range and return positions where x
was aligned. The seqlevels of the return object are taken from the
transcripts object and should be transcript names. In this
direction, mapping is attempted between all elements of x and
all elements of transcripts.
mapFromTranscripts, pmapFromTranscripts
The transcript-based position in x is mapped to genomic coordinates
using the ranges in transcripts. A successful mapping occurs when
the following is TRUE:
width(transcripts) >= start(x) + width(x)
x is aligned to transcripts by moving in
start(x) positions in from the beginning of the
transcripts range. The seqlevels of the return object are
genomic chromosomes.
When mapping to the genome, name matching is used to determine the
mapping pairs (vs attempting to match all possible pairs). Ranges in
x are only mapped to ranges in transcripts with the
same name. Name matching is motivated by use cases such as
differentially expressed regions where the expressed regions in
x would only be related to a subset of regions in
transcripts, which may contains gene or transcript ranges.
element-wise versions
pmapToTranscripts and pmapFromTranscripts are element-wise
(aka 'parallel') versions of mapToTranscripts and
mapFromTranscripts. The i-th range in x is mapped to the
i-th range in transcripts; x and transcripts must
have the same length.
Ranges in x that do not map (out of bounds or strand mismatch)
are returned as zero-width ranges starting at 0. These ranges are given
the special seqname of "UNMAPPED". Note the non-parallel methods do not
return unmapped ranges so the "UNMAPPED" seqname is unique to
pmapToTranscipts and pmapFromTranscripts.
An object the same class as x.
Parallel methods return an object the same shape as x. Ranges that
cannot be mapped (out of bounds or strand mismatch) are returned as
zero-width ranges starting at 0 with a seqname of "UNMAPPED".
Non-parallel methods return an object that varies in length similar to a
Hits object. The result only contains mapped records, strand mismatch
and out of bound ranges are not returned. xHits and
transcriptsHits metadata columns indicate the elements of x
and transcripts used in the mapping.
When present, names from x are propagated to the output. When
mapping to transcript coordinates, seqlevels of the output are the names
on the transcripts object; most often these will be transcript
names. When mapping to the genome, seqlevels of the output are the seqlevels
of transcripts which are usually chromosome names.
V. Obenchain, M. Lawrence and H. Pages
?mapToAlignments in the
GenomicAlignments package for methods mapping between
reads and genome space using a CIGAR alignment.
## ---------------------------------------------------------------------
## A. Basic Use
## ---------------------------------------------------------------------
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
cds <- cdsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "tx", use.names=TRUE)[1:3]
x <- GRanges("chr2L",
IRanges(c(7500, 8400, 9000), width=200,
names=LETTERS[1:3]))
## Map to transcript coordinates:
mapToTranscripts(x, cds)
## Note the seqnames of the output are the transcript names, not
## chromosomes. The 'xHits' and 'transcriptsHits' metadata
## columns indicate which elements were involved in the mapping.
## The element-wise version returns both mapped and unmapped ranges.
x <- GRanges("chr2L", IRanges(c(8100, 8600), width=10, names=LETTERS[1:2]))
pmapToTranscripts(x, cds[1:2])
## Mapping to genome space requires both 'x' and 'transcripts' to have
## names. A map is only attempted for ranges with matching names.
x <- GRanges("chr1",
IRanges(c(1, 20, 10, 3), width=5, names=c("A", "B", "A", "C")))
gr <- GRanges("chr1",
IRanges(c(25, 30, 3), width=10, names=c("C", "C", "A")))
mapFromTranscripts(x, gr)
## ---------------------------------------------------------------------
## B. Map local sequence locations to the genome
## ---------------------------------------------------------------------
## NAGNAG alternative splicing plays an essential role in biological processes
## and represents a highly adaptable system for posttranslational regulation
## of gene function. The majority of NAGNAG studies largely focus on messenger
## RNA. A study by Sun, Lin, and Yan
## (http://www.hindawi.com/journals/bmri/2014/736798/) demonstrated that
## NAGNAG splicing is also operative in large intergenic noncoding RNA
## (lincRNA).
## One finding of interest was that linc-POLR3G-10 exhibited two NAGNAG
## acceptors located in two distinct transcripts: TCONS_00010012 and
## TCONS_00010010.
## Extract the exon coordinates of TCONS_00010012 and TCONS_00010010:
lincrna <- c("TCONS_00010012", "TCONS_00010010")
library(TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts)
txdb <- TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts
exons <- exonsBy(txdb, by="tx", use.names=TRUE)[lincrna]
exons
## The two NAGNAG acceptors were identified in the upstream region of
## the fourth and fifth exons located in TCONS_00010012.
## Extract the sequences for transcript TCONS_00010012:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
exons_seq <- getSeq(genome, exons[[1]])
## TCONS_00010012 has 4 exons:
exons_seq
## The most common triplet among the lincRNA sequences was CAG. Identify
## the location of this pattern in all exons.
cag_loc <- vmatchPattern("CAG", exons_seq)
## Convert the first occurance of CAG in each exon back to genome coordinates.
first_loc <- do.call(c, sapply(cag_loc, "[", 1, simplify=TRUE))
pmapFromTranscripts(first_loc, exons[[1]])
## -----------------------------------------------------------------------
## C. Map 3'UTR variants to genome coordinates
## -----------------------------------------------------------------------
## A study by Skeeles et. al (PLoS ONE 8(3): e58609. doi:
## 10.1371/journal.pone.0058609) investigated the impact of 3'UTR variants
## on the expression of cancer susceptibility genes.
## 8 candidate miRNA genes on chromosome 12 were used to test for
## differential luciferase expression in mice. In Table 2 of the manuscript
## variant locations are given as nucleotide position within the gene.
geneNames <- c("Bcap29", "Dgkb", "Etv1", "Hbp1", "Hbp1", "Ifrd1",
"Ifrd1", "Pik3cg", "Pik3cg", "Tspan13", "Twistnb")
starts <- c(1409, 3170, 3132, 2437, 2626, 3239, 3261, 4947, 4979, 958, 1489)
snps <- GRanges("chr12", IRanges(starts, width=1, names=geneNames))
## To map these transcript-space coordinates to the genome we need gene ranges
## in genome space.
library(org.Mm.eg.db)
geneid <- select(org.Mm.eg.db, unique(geneNames), "ENTREZID", "SYMBOL")
geneid
## Extract the gene regions:
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
genes <- genes(txdb)[geneid$ENTREZID]
## mapToTranscripts(..., reverse=TRUE) determines pairs to map by comparing
## names in 'x' and 'transcripts'. Ranges in 'snps' will be mapped to all
## ranges in 'genes' with the same name. Currently the names of 'genes' are
## internal gene ids. Rename 'genes' with the appropriate gene symbol.
names(genes) <- geneid$SYMBOL
## The xHits and transcriptsHits metadta columns indicate which ranges in
## 'snps' and 'genes' were involved in the mapping.
mapFromTranscripts(snps, genes)
## -----------------------------------------------------------------------
## D. Map dbSNP variants to cds or cDNA coordinates
## -----------------------------------------------------------------------
## The GIPR gene encodes a G-protein coupled receptor for gastric inhibitory
## polypeptide (GIP). Originally GIP was identified to inhibited gastric acid
## secretion and gastrin release but was later demonstrated to stimulate
## insulin release in the presence of elevated glucose.
## In this example 5 SNPs located in the GIPR gene are mapped to cDNA
## coordinates. A list of SNPs in GIPR can be downloaded from dbSNP or NCBI.
rsids <- c("rs4803846", "rs139322374", "rs7250736", "rs7250754", "rs9749185")
## Extract genomic coordinates with a SNPlocs package.
library(SNPlocs.Hsapiens.dbSNP141.GRCh38)
snps <- snpid2grange(SNPlocs.Hsapiens.dbSNP141.GRCh38, rsids)
## Gene regions of GIPR can be extracted from a TxDb package of compatible
## build. The TxDb package uses Entrez gene identifiers and GIPR is a gene
## symbol. Conversion between gene symbols and Entrez gene IDs is done by
## calling select() on an organism db package.
library(org.Hs.eg.db)
geneid <- select(org.Hs.eg.db, "GIPR", "ENTREZID", "SYMBOL")
## The transcriptsBy() extractor returns a range for each transcript that
## includes the UTR and exon regions (i.e., cDNA).
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txbygene <- transcriptsBy(txdb, "gene")
cDNA <- txbygene[geneid$ENTREZID]
cDNA
## Before mapping, the chromosome names (seqlevels) in the two objects must
## be harmonized. The style for 'snps' is dbSNP and 'cDNA' is UCSC.
seqlevelsStyle(snps)
seqlevelsStyle(cDNA)
## Modify the style and genome in 'snps' to match 'cDNA'.
seqlevelsStyle(snps) <- seqlevelsStyle(cDNA)
genome(snps) <- genome(cDNA)
## The 'cDNA' object is a GRangesList of length 1. This single list element
## contains the cDNA range for 4 different transcripts. To map to each
## transcript individually 'cDNA' must be unlisted before mapping.
## Map all 5 SNPS to all 4 transcripts:
mapToTranscripts(snps, unlist(cDNA))
## Map the first SNP to transcript uc002pct.1 and the second to uc002pcu.1.
pmapToTranscripts(snps[1:2], unlist(cDNA)[1:2])
## The cdsBy() extractor returns coding regions by gene or by transcript.
## Extract the coding regions for transcript uc002pct.1.
cds <- cdsBy(txdb, "tx", use.names=TRUE)["uc002pct.1"]
cds
## The 'cds' object is a GRangesList of length 1 containing all cds ranges
## for the single transcript uc002pct.1. To map to the concatenated group
## of ranges we leave 'cds' as a GRangesList.
## Map all 5 SNPs to the total cds region:
mapToTranscripts(snps, cds)
## Only the second SNP could be mapped. Unlisting the 'cds' object maps the
## SNPs to the individual cds ranges (vs the concatenated range).
mapToTranscripts(snps[2], unlist(cds))
## The location is the same because the SNP hit the first cds range. If the
## transcript had been on the negative strand the difference in mapping
## would be more obvious.
strand(snps) <- strand(cds) <- "-"
mapToTranscripts(snps[2], cds, ignore.strand=FALSE)
mapToTranscripts(snps[2], unlist(cds), ignore.strand=FALSE)