Seqinfo-class {GenomeInfoDb}R Documentation

Seqinfo objects

Description

A Seqinfo object is used to store basic information about a set of genomic sequences, typically chromosomes, but not necessarily.

Details

A Seqinfo object has one entry per sequence. Each entry contains the following information about the sequence:

All entries must contain at least the sequence name. The other information is optional. In addition, the seqnames in a given Seqinfo object must be unique, that is, the object is not allowed to have two entries with the same sequence name.

Note that Seqinfo objects are usually not used as standalone objects but are instead typically found inside higher level objects like GRanges or TxDb objects. These higher level objects will generally provide a seqinfo() accessor for getting/setting their Seqinfo component.

Constructor

Seqinfo(seqnames, seqlengths=NA, isCircular=NA, genome=NA): Create a Seqinfo object and populate it with the supplied data.

One special form of calling the Seqinfo() constructor is to specify only the genome argument and set it to the name of an NCBI assembly (e.g. Seqinfo(genome="GRCh38.p12")) or UCSC genome (e.g. Seqinfo(genome="hg38")), in which case the sequence information is fetched from NCBI or UCSC. See Examples section below for some examples.

Accessor methods

In the code snippets below, x is a Seqinfo object.

length(x): Return the number of sequences in x.

seqnames(x), seqnames(x) <- value: Get/set the names of the sequences in x. Those names must be non-NA, non-empty and unique. They are also called the sequence levels or the keys of the Seqinfo object.

Note that, in general, the end user should not try to alter the sequence levels with seqnames(x) <- value. The recommended way to do this is with seqlevels(x) <- value as described below.

names(x), names(x) <- value: Same as seqnames(x) and seqnames(x) <- value.

seqlevels(x): Same as seqnames(x).

seqlevels(x) <- value: Can be used to rename, drop, add and/or reorder the sequence levels. value must be either a named or unnamed character vector. When value has names, the names only serve the purpose of mapping the new sequence levels to the old ones. Otherwise (i.e. when value is unnamed) this mapping is implicitly inferred from the following rules:

(1) If the number of new and old levels are the same, and if the positional mapping between the new and old levels shows that some or all of the levels are being renamed, and if the levels that are being renamed are renamed with levels that didn't exist before (i.e. are not present in the old levels), then seqlevels(x) <- value will just rename the sequence levels. Note that in that case the result is the same as with seqnames(x) <- value but it's still recommended to use seqlevels(x) <- value as it is safer.

(2) Otherwise (i.e. if the conditions for (1) are not satisfied) seqlevels(x) <- value will consider that the sequence levels are not being renamed and will just perform x <- x[value].

See below for some examples.

seqlengths(x), seqlengths(x) <- value: Get/set the length for each sequence in x.

isCircular(x), isCircular(x) <- value: Get/set the circularity flag for each sequence in x.

genome(x), genome(x) <- value: Get/set the genome identifier or assembly name for each sequence in x.

Subsetting

In the code snippets below, x is a Seqinfo object.

x[i]: A Seqinfo object can be subsetted only by name i.e. i must be a character vector. This is a convenient way to drop/add/reorder the entries in a Seqinfo object.

See below for some examples.

Coercion

In the code snippets below, x is a Seqinfo object.

as.data.frame(x): Turns x into a data frame.

Combining Seqinfo objects

We provide no c() or rbind() methods for Seqinfo objects. This is because these operations would be expected to return an object of length length(x) + length(y) obtained by appending the entries from input object y to those from input object x. Such behavior would imply that the returned Seqinfo object is not guaranteed to contain unique seqnames, which is a requirement for Seqinfo objects.

So instead, a merge() method is provided.

In the code snippet below, x and y are Seqinfo objects.

merge(x, y, ...): Merge x and y into a single Seqinfo object where the keys (i.e. the seqnames) are union(seqnames(x), seqnames(y)). If an entry in y has the same key as an entry in x, and if the two entries contain compatible information (NA values are treated as wildcards i.e. they're compatible with anything), then the two entries are merged into a single entry in the result. If they cannot be merged (because they contain different seqlengths, and/or circularity flags, and/or genome identifiers), then an error is raised. In addition to check for incompatible sequence information, merge(x, y) also compares seqnames(x) with seqnames(y) and issues a warning if each of them has names not in the other. The purpose of these checks is to try to detect situations where the user might be combining or comparing objects that use different underlying genomes.

Note that merge() can take more than two Seqinfo objects, in which case the objects are merged from left to right e.g.

    merge(x1, x2, x3, x4)

is equivalent to

    merge(merge(merge(x1, x2), x3), x4)

intersect(x, y): Finds the intersection between two Seqinfo objects by merging them and subsetting for the intersection of their sequence names. This makes it easy to avoid warnings about each objects not being a subset of the other one during overlap operations.

A convenience wrapper, checkCompatibleSeqinfo(), is provided for checking whether 2 objects have compatible seqinfo components or not. checkCompatibleSeqinfo(x, y) is equivalent to merge(seqinfo(x), seqinfo(y)) so will work on any objects x and y that support seqinfo().

Author(s)

H. Pagès

See Also

Examples

## ---------------------------------------------------------------------
## A. MAKING A Seqinfo OBJECT FOR A GIVEN NCBI ASSEMBLY OR UCSC GENOME
## ---------------------------------------------------------------------

## One special form of calling the 'Seqinfo()' constructor is to specify
## only the 'genome' argument and set it to the name of an NCBI assembly
## or UCSC genome, in which case the sequence information is fetched
## from NCBI or UCSC ('getChromInfoFromNCBI()' or 'getChromInfoFromUCSC()'
## are used behind the scene for this so internet access is required).

if (interactive()) {
  ## NCBI assemblies (see '?registered_NCBI_assemblies' for the list of
  ## NCBI assemblies that are currently supported):
  Seqinfo(genome="GRCh38")
  Seqinfo(genome="GRCh38.p12")
  Seqinfo(genome="Amel_HAv3.1")
  Seqinfo(genome="WBcel235")
  Seqinfo(genome="TAIR10.1")

  ## UCSC genomes (see '?registered_UCSC_genomes' for the list of UCSC
  ## genomes that are currently supported):
  Seqinfo(genome="hg38")
  Seqinfo(genome="mm10")
  Seqinfo(genome="rn6")
  Seqinfo(genome="bosTau9")
  Seqinfo(genome="canFam3")
  Seqinfo(genome="musFur1")
  Seqinfo(genome="galGal6")
  Seqinfo(genome="dm6")
  Seqinfo(genome="ce11")
  Seqinfo(genome="sacCer3")
}

## ---------------------------------------------------------------------
## B. BASIC MANIPULATION OF A Seqinfo OBJECT
## ---------------------------------------------------------------------

## Note that all the arguments (except 'genome') must have the
## same length. 'genome' can be of length 1, whatever the lengths
## of the other arguments are.
x <- Seqinfo(seqnames=c("chr1", "chr2", "chr3", "chrM"),
             seqlengths=c(100, 200, NA, 15),
             isCircular=c(NA, FALSE, FALSE, TRUE),
             genome="toy")
x

## Accessors:
length(x)
seqnames(x)
names(x)
seqlevels(x)
seqlengths(x)
isCircular(x)
genome(x)

## Get a compact summary:
summary(x)

## Subset by names:
x[c("chrY", "chr3", "chr1")]

## Rename, drop, add and/or reorder the sequence levels:
xx <- x
seqlevels(xx) <- sub("chr", "ch", seqlevels(xx))  # rename
xx
seqlevels(xx) <- rev(seqlevels(xx))  # reorder
xx
seqlevels(xx) <- c("ch1", "ch2", "chY")  # drop/add/reorder
xx
seqlevels(xx) <- c(chY="Y", ch1="1", "22")  # rename/reorder/drop/add
xx

## ---------------------------------------------------------------------
## C. MERGING 2 Seqinfo OBJECTS
## ---------------------------------------------------------------------

y <- Seqinfo(seqnames=c("chr3", "chr4", "chrM"),
             seqlengths=c(300, NA, 15))
y

## This issues a warning:
merge(x, y)  # the entries for chr3 and chrM contain information merged
             # from the corresponding entries in 'x' and 'y'

## To get rid of the above warning, either use suppressWarnings() or
## set the genome on 'y':
suppressWarnings(merge(x, y))
genome(y) <- genome(x)
merge(x, y)

## Note that, strictly speaking, merging 2 Seqinfo objects is not
## a commutative operation, i.e., in general 'z1 <- merge(x, y)'
## is not identical to 'z2 <- merge(y, x)'. However 'z1' and 'z2'
## are guaranteed to contain the same information (i.e. the same
## entries, but typically not in the same order):
merge(y, x)

## This contradicts what 'x' says about circularity of chr3 and chrM:
isCircular(y)[c("chr3", "chrM")] <- c(TRUE, FALSE)
y
if (interactive()) {
  merge(x, y)  # raises an error
}

## Sanity checks:
stopifnot(identical(x, merge(x, Seqinfo())))
stopifnot(identical(x, merge(Seqinfo(), x)))
stopifnot(identical(x, merge(x, x)))

## ---------------------------------------------------------------------
## D. checkCompatibleSeqinfo()
## ---------------------------------------------------------------------

library(GenomicRanges)
gr1 <- GRanges("chr3:15-25", seqinfo=x)
gr2 <- GRanges("chr3:105-115", seqinfo=y)
if (interactive()) {
  checkCompatibleSeqinfo(gr1, gr2)  # raises an error
}

[Package GenomeInfoDb version 1.28.0 Index]