UCSCTableQuery-class {rtracklayer} | R Documentation |
The UCSC genome browser is backed by a large database,
which is exposed by the Table Browser web interface. Tracks are
stored as tables, so this is also the mechanism for retrieving tracks. The
UCSCTableQuery
class represents a query against the Table
Browser. Storing the query fields in a formal class facilitates
incremental construction and adjustment of a query.
There are six supported fields for a table query:
The provider should be a session, a genome identifier, or a TrackHub URI.
session
: The UCSCSession
instance from
the tables are retrieved. Although all sessions are based on the
same database, the set of user-uploaded tracks, which are represented
as tables, is not the same, in general.
The name of the specific table to retrieve. May be
NULL
, in which case the behavior depends on how the query
is executed, see below.
A genome identifier, a
GRanges
or
a IntegerRangesList
indicating
the portion of the table to retrieve, in genome coordinates.
Simply specifying the genome string is the easiest way to download
data for the entire genome, and GRangesForUCSCGenome
facilitates downloading data for e.g. an entire chromosome.
The URI of the specific TrackHub
A genome identifier of the specific TrackHub, only need to provide it if the provider is up of TrackHub URI.
Names/accessions of the desired features
A common workflow for querying the UCSC database is to create an
instance of UCSCTableQuery
using the ucscTableQuery
constructor, invoke tableNames
to list the available tables for
a track, and finally to retrieve the desired table either as a
data.frame
via getTable
or as a track
via track
. See the examples.
The reason for a formal query class is to facilitate multiple queries
when the differences between the queries are small. For example, one
might want to query multiple tables within the track and/or same
genomic region, or query the same table for multiple regions. The
UCSCTableQuery
instance can be incrementally adjusted for each
new query. Some caching is also performed, which enhances performance.
ucscTableQuery(x, range = seqinfo(x), table = NULL,
names = NULL, hubUrl = NULL, genome = NULL)
: Creates a UCSCTableQuery
with the
UCSCSession
, genome identifier or TrackHub URI given as x
and
the table name given by the single string table
. range
should
be a genome string identifier, a GRanges
instance or
IntegerRangesList
instance, and it effectively defaults to
genome(x)
. If the genome is missing, it is taken from the
provider. Feature names, such as gene identifiers, may be
passed via names
as a character vector.
Below, object
is a UCSCTableQuery
instance.
track(object)
:
Retrieves the indicated table as a track, i.e. a GRanges
object. Note that not all tables are available as tracks.
getTable(object)
: Retrieves the indicated table as a
data.frame
. Note that not all tables are output in
parseable form, and that UCSC will truncate responses if they
exceed certain limits (usually around 100,000 records). The safest
(and most efficient) bet for large queries is to download the file
via FTP and query it locally.
tableNames(object)
: Gets the names of the tables available
for the provider, table and range specified by the query.
In the code snippets below, x
/object
is a
UCSCTableQuery
object.
genome(x)
, genome(x) <- value
: Gets or sets
the genome identifier (e.g. “hg18”) of the object.
hubUrl(x)
, hubUrl(x) <- value
: Gets or sets
the TrackHub URI.
tableName(x)
, tableName(x) <- value
: Get or
set the single string indicating the name of the table to
retrieve. May be NULL
, in which case the table is
automatically determined.
range(x)
, range(x) <- value
: Get or set the
GRanges
indicating the portion of the table to retrieve in
genomic coordinates. Any missing information, such as the genome
identifier, is filled in using range(browserSession(x))
. It
is also possible to set the genome identifier string or
a IntegerRangesList
.
names(x)
, names(x) <- value
: Get or set the
names of the features to retrieve. If NULL
, this filter is
disabled.
ucscSchema(x)
: Get
the UCSCSchema
object describing the selected table.
ucscTables(genome, track)
: Get the list of tables for the
specified track(e.g. “Assembly”) and genome identifier (e.g. “hg19”).
Here genome
and track
must be a single non-NA string.
Michael Lawrence
## Not run: # query using `session` provider session <- browserSession() genome(session) <- "mm9" ## choose the phastCons30way table for a portion of mm9 chr1 query <- ucscTableQuery(session, table = "phastCons30way", range = GRangesForUCSCGenome("mm9", "chr12", IRanges(57795963, 57815592))) ## list the table names tableNames(query) ## retrieve the track data track(query) # a GRanges object ## get the multiz30waySummary track tableName(query) <- "multiz30waySummary" ## get a data.frame summarizing the multiple alignment getTable(query) # query using `genome identifier` provider query <- ucscTableQuery("hg18", table = "snp129", names = c("rs10003974", "rs10087355", "rs10075230")) ucscSchema(query) getTable(query) # query using `TrackHub URI` provider query <- ucscTableQuery("https://ftp.ncbi.nlm.nih.gov/snp/population_frequency/TrackHub/20200227123210/", genome = "hg19", table = "ALFA_GLB") getTable(query) # get the list of tables for 'Assembly' track and 'hg19' genome identifier ucscTables("hg19", "Assembly") ## End(Not run)