The ggbio package is a great tool to use for visualizing next generation
sequencing data. It combines the graphing power of the ggplot2 package
with the biology aware data structures in bioconductor. The package includes
support for plotting genes in the standard databases supplied by bioconductor,
which works well for heavily studied organisms such as human and mouse.
If you are interested in a less well annotated organism, there is no prebuilt
database to pull from. In this case often the gene annotation is described as
a general feature format (gff) file. This specification has gone through
a number of revisions through the years, the most current of which is gff3.
Unfortunately, there is no current function to import gff3 into the GRangeList
format which is used by ggbio to plot genes. This specification also is
somewhat complex to parse in R due to there being multiple levels of
relationships and optional fields.
The genomeIntervals package contains a function to read gff3 files, however
this creates a “Genome_intervals_stranded” object, not the GRangesList we need
for ggbio. The easyRNASeq package has a function to convert the
Genome_intervals_stranded object into a GRangesList, but that seems like
a large unrelated dependency, it would probably be easier to just parse and
create the object directly as a GRangesList. Fortunately using Rcpp, some tips
from the above packages and some convenient helper functions it is actually
First we create a helper function to do string tokenizing
Then the c++ parsing code to convert the gff file to a data frame. Note we
have to store the entire file in memory before constructing the data frame to
determine the number of columns due to the optional attributes.
This gives us the file as a dataframe in R. We then need to transform the data
frame into a GRangeList object for ggbio. One problem with constructing the
data frame in the C++ code the way that I did it is that all the columns are
created as strings, even though a number of the columns are numeric, and the
others can probably be factors. Luckily using the all_numeric function from
the Hmisc package will do the test and conversion for us.
Then all we have to do is split the data based on seqid and grouping variable
to construct the GRangesList object we want.
This function can then be used to read and plot a gene annotation with ggbio
Note that parsing gtf files can be done with the same code, you just need to
change the attribute seperator from ‘=’ to ‘ ‘
The code for all of the above functions are at this gist.