getSeq {BSgenome} | R Documentation |
A convenience function for extracting a set of sequences (or subsequences) from a BSgenome or other object. This man page specifically documents the BSgenome method.
getSeq(x, ...) ## S4 method for signature 'BSgenome': getSeq(x, names, start=NA, end=NA, width=NA, strand="+", as.character=TRUE)
x |
A BSgenome object.
See the available.genomes function for how to install
a genome.
|
names |
The names of the sequences to extract from x .
If missing, then seqnames(x) is used.
See ?seqnames and ?mseqnames to get
the list of single sequences and multiple sequences (respectively)
contained in x .
Here is how the lookup between the names passed to the names
argument and the sequences in x is performed.
For each name in names :
(1) if x contains a single sequence with that name
then this sequence is returned;
(2) otherwise the names of all the elements in all the multiple
sequences are searched: name is treated as a regular
expression and grep is used for this search.
If exactly one sequence is found, then it's returned, otherwise an
error is raised.
|
start, end, width |
Vector of integers (eventually with NAs). |
strand |
A vector containing + s or/and - s.
|
as.character |
TRUE or FALSE . Should the extracted sequences
be returned in a standard character vector?
|
The names
, start
, end
, width
and strand
arguments are expanded cyclically to the length of the longest provided
none are of zero length.
A standard character vector when as.character=TRUE
.
Note that when as.character=TRUE
, then the masks that
are defined on top of the sequences to extract are ignored
(i.e. dropped) if any (see ?`MaskedXString-class`
for more information about masked sequences).
A DNAString or MaskedDNAString
object when as.character=FALSE
.
Note that as.character=FALSE
is not supported yet when extracting
more than one sequence.
Be aware that using as.character=TRUE
can be very inefficient
when the returned character vector contains very long strings
(> 1 million letters) or is itself a long vector (> 10000 strings).
getSeq
is much more efficient when used with
as.character=FALSE
but this works only for extracting
one sequence at a time for now.
H. Pages; improvements suggested by Matt Settles
available.genomes
,
BSgenome-class,
seqnames
,
mseqnames
,
grep
,
subseq
,
DNAString
,
MaskedDNAString
,
[[,BSgenome-method
# Load the Caenorhabditis elegans genome (UCSC Release ce2): library(BSgenome.Celegans.UCSC.ce2) # Look at the index of sequences: Celegans # Get chromosome V as a DNAString object: getSeq(Celegans, "chrV", as.character=FALSE) # which is in fact the same as doing: Celegans$chrV # Never try this: #getSeq(Celegans, "chrV") # or this (even worse): #getSeq(Celegans) # Get the first 20 bases of each chromosome: getSeq(Celegans, end=20) # Get the last 20 bases of each chromosome: getSeq(Celegans, start=-20) # Extracting small sequences from different chromosomes: myseqs <- data.frame( chr=c("chrI", "chrX", "chrM", "chrM", "chrX", "chrI", "chrM", "chrI"), start=c(NA, -40, 8510, 301, 30001, 9220500, -2804, -30), end=c(50, NA, 8522, 324, 30011, 9220555, -2801, -11), strand=c("+", "-", "+", "+", "-", "-", "+", "-") ) getSeq(Celegans, myseqs$chr, start=myseqs$start, end=myseqs$end) getSeq(Celegans, myseqs$chr, start=myseqs$start, end=myseqs$end, strand=myseqs$strand) # Get the "NM_058280_up_1000" sequence (belongs to the upstream1000 # multiple sequence) as a character string: s1 <- getSeq(Celegans, "NM_058280_up_1000") # or a DNAString object (more efficient): s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE) getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1 # TRUE getSeq(Celegans, "NM_058280_up_5000", start=-1000, as.character=FALSE) == s2 # TRUE