Jim Hester

Big Data Bioinformatics My thoughts on next generation sequencing, linux, and programming.

May 20, 2015

Covr - Test Coverage for R Packages

Being primarily a statistical language R lacks a number of common code analysis tools available for languages used more often for general programming. Testing has been done in the R source from very early on (Martin Maechler started adding tests on April 9th, 1998, 17 years ago!) However most of this testing was ad hoc, either requiring visual examination of the outputs, or simply testing current vs previous outputs. The RUnit (2005), svUnit (2009) and testthat (2009) packages brought more formal unit testing into R community.

So while tools existed to show authors how to test, there were very limited tools to show what to test. Code coverage has a very long history (Miller et al. 1963) in the software development community and remains a useful metric today. Unfortunately the R community has had very few options to track and report on code coverage. R-coverage by Karl Forner and testCoverage by Tom Taverner, Chris Campbell, Suchen Jin were two existing solutions, however both were somewhat problematic to use. The former required a patch to the R source, which is a non-starter for widespread use (particularly on Windows). The latter had a complicated implementation which was difficult to use interactively and supported only a limited output format. Neither project supported tracking coverage of compiled code in addition to R source or supported coverage tracking services such as http://codecov.io or http://coveralls.io.

Covr aims to make code coverage both easy to use and informative in R. Package coverage can be tracked and tests run very simply

cov <- package_coverage()

The results can be summarized with an informative print() method, as well as coerced to a data.frame. There is also an interactive Shiny App available via shine() which can be used to inspect the code visually and identify coverage gaps.

In addition covr supports both the http://codecov.io and http://coveralls.io coverage reporting services. These services will track the coverage of a project over time and report on coverage changes as well as exploration of the source code. They are designed to work seamlessly within a CI system such as https://travis-ci.org/ or http://wercker.com/. Simple functions exist to instrument the code, run tests and submit the results for both services and can easily be added to the CI build script to automatically run coverage for every build.

# for codecov.io

# for coveralls.io

Covr version 1.0.0 is now on CRAN and Covr development is ongoing at https://github.com/jimhester/covr, pull requests for additional features or bug fixes are always welcome!

Apr 28, 2015

Wercker and Rocker: Finally Performant Continuous Integration for R

The Problem

Continuous Integration is a great technique for both developers, contributers and users to ensure that the development build of a project remains in a working state. In the R community there are a few different CI setups in use. The CRAN and Bioconductor package repositories both run nightly build checks for all of their packages using custom build servers. More recently continuous builds with Travis-CI on Github repositories have grown in popularity.

I personally have found great utility of using Travis for my projects. Gmailr, rex, lintr and covr all were developed using Travis and lintr and covr were both conceived with continuous integration in mind.

Unfortunately I have found over time that builds on Travis have been taking longer and longer in both queue time and runtime. I think is due to two main points,

  1. Travis has risen in popularity over the past few years and their build stack has not kept pace.
  2. The build environments for R projects have gotten more feature-full and take longer to build.

Point one will eventually happen to any CI system that gains popularity, but it does show an advantage to using a less popular system. Point two occurs largely due to a deficiency in the Travis build model. Travis’ container support allows only their pre-built containers and you cannot use any sudo commands in your build steps. R projects require somewhat heavy dependencies to build (Latex being the largest culprit), and many package dependencies have to be compiled from the source. As a result often a build spends 90% or more of it’s time downloading and installing the dependencies.

Docker Containers

One solution to this dependency problem is Docker containers. These containers allow you to distribute an application stack as a self-contained and standalone package. The benefit for using Docker containers for continuous integration is that once you have downloaded the container for the first time, subsequent builds using it become instantaneous.

Fortunately the R community has realized the utility of Docker containers and maintains an up to date collection of R containers in the Rocker project. These containers contain all the necessary dependencies for both the release versions of R as well as daily version of R-devel. In addition there are containers with pre-installed packages from the Hadleyverse and ROpenSci projects. Bioconductor also maintains a set of Docker Containers based on the Rocker base which contain pre-built Bioconductor packages. All of these containers can be found on the Docker Hub, which is a large collection of community contributed Docker containers.


Wercker is described as

an open automation platform for developing, building and delivering your applications

It was historically based on a traditional build stack, however on April 2nd they introduced a new Docker based stack which they call Ewok. This allows you to use any Docker container hosted on Docker Hub as the base image for your build.

Each Wercker build consists of a series of steps, which can be either a series of shell commands to run or pre-defined steps from the Step Registry. Anyone can create a new step and add it to the registry, and I have created a seriers of steps for R.

In addition because Wercker stores the results of the build as a Docker container you can download them afterwards and inspect the results (wercker pull). Builds can even be re-run locally under the same environment as the build server using wercker build.

The Solution

So using this new Docker based Ewok stack we can use a Rocker container and have our dependencies nearly instantly after the first build!

Once you have the new Application setup in Wercker you simply can add the following content to a wercker.yml file in your packages root directory.

box: rocker/hadleyverse
    - jimhester/r-dependencies
    - jimhester/r-check
    - jimhester/r-lint
    - jimhester/r-coverage

This will install all dependencies for your package, build and check it, run it through lint the code with lintr, and generate code coverage with covr and upload the results to Codecov.

For more detailed setup instructions please see jimhester/wercker-r-example.

Custom Containers

The above is already a large improvement over Travis builds, however perhaps your package has a large package dependency which is not already present in the Rocker or Bioconductor images. Never fear, you can simply create your own custom image with the dependency installed and upload it to Docker Hub. Then simply change the box: value in wercker.yml to the location of your new image and get instant builds even with heavy dependencies!

See Also

Apr 23, 2014

Responsive Remote Completion

I have been using remote completion from remote boxes for a long time, as it is available in both zsh and bash. However because establishing an SSH connection is slow, the completion is not terribly useful, you spend more time waiting for it to complete than it saves in typing. However if you set up a few SSH options to keep a master connection alive and for subsequent SSH connections to use the master connection it then is very snappy.

First you need to set up password-less authentication, the easiest way is to use ssh-copy-id.

ssh-copy-id remote-server

then make a controlmasters directory in your .ssh directory, which will store the controlmaster session information.

mkdir -p ~/.ssh/controlmasters/

Lastly add the following lines to your .ssh/config file.

Host *
  ControlPath ~/.ssh/controlmasters/%r@%h:%p
  ControlMaster auto
  ControlPersist yes

Then use your now snappy ssh completions!

Feb 12, 2014

ZSH global aliases

Just a short post about two useful global aliases I created. ZSH global aliases are basically variables which are expanded before the command is executed. This allows them to be placed anywhere on the line, not just at the start like traditional aliases. Bash (as far as I know) does not have an analog to ZSH global aliases, but I have found them very useful.

Both of them produce exactly the same output (unique lines in a file), but in two different ways.

alias -g U="awk '!a[\$0]++'"
alias -g SU='\(sort | uniq\)'

The first uses awk to hash the lines seen, and only print the line if the current line is not in the hash. This finds the unique lines without sorting, which preserves the original order is usually much much faster. The only issue is it can exhaust system memory if used on extremely large files.

How much faster is hashing for uniques rather than sorting for uniques?

Note these commands use GNU shuf to randomly shuffle

gshuf /usr/share/dict/words | head -n 100000 > words #get 100000 random words
(for i in {1..100};do gshuf words | head -n 1000;done;cat words) | gshuf > dup_words #randomly pick 1000 words to duplicate 100 times, append the full list and randomize

Now lets count unique lines in the files

time awk '!a[$0]++'< dup_words | wc -l
#  100000
#  awk '!a[$0]++' < dup_words  0.14s user 0.00s system 99% cpu 0.149 total
#  wc -l  0.00s user 0.00s system 2% cpu 0.147 total
time (sort | uniq) < dup_words | wc -l
#  100000
#  ( sort | uniq; ) < dup_words  2.45s user 0.01s system 100% cpu 2.438 total
#  wc -l  0.00s user 0.00s system 0% cpu 2.437 total

So you get the same results 16X faster using the hashing strategy.

And lets use the aliases which started this whole mess

U < dup_words | wc -l
SU < dup_words | wc -l

cat dup_words | U | wc -l
cat dup_words | SU | wc -l

All in all this allows you to do sorting faster and with less typing than the ubiquitous | sort | uniq pattern.

Mar 28, 2013

Stylish reports with Knitr and Bootstrap

I am pleased to announce my project for generating styled, dynamic html reports knitr_bootstrap. Here is an example of a report.

I have been using knitr to produce reports for around 6 months or so now, first using pandoc to generate the html, lately I have been using the Rmarkdown R package instead. This has a couple of advantages over pandoc. First, it is much easier to install as anyone who has tried to install haskell on CentOS can attest to. Secondly, the Rmarkdown package is the same one which is used for Rstudio. While I do not use RStudio much myself, preferring the power of vim, the ease of use and growing ubiquity means that more people can make use of the package.

The only issue with Rmarkdown is the default stylesheet is fairly simplistic, and the table of contents is not as full featured as the one generated by pandoc. I also would prefer to have plots embedded as thumbnails rather than full size, with the option to click on them to zoom. It would also be nice to style the report using [bootstrap], which gives it a much more polished feel.

After a fair amount of work understanding the jQuery framework and [bootstrap], I have gotten the reports feature full enough for others to try. Current features include

  • Automatic table of contents constructed from h1-4 tags using tocify
  • Output automatically resized to fit the display
  • Code/Output block and plot visibility can be toggled with a mouse click
  • Images are automatically thumbnailed and lightboxed using fancybox
  • Code blocks in >50 languages automatically highlighted using highlight.js
  • Automatically wraps code block with div elements to work with bootstrap
  • Can restyle with bootstrap compatible css
  • Completely offline report usiing knitr_bootstrap_standalone.html

It does require a small patch to the Rmarkdown package to work as I am using it, but it should be an fairly easy to accept, and if not workarounds are possible, if less elegant. Please give it a try and let me know your thoughts!

Mar 11, 2013

Setting ggplot2 default color scales

ggplot2 is a very nice R plotting library, however some people do not like the default color scales for plots. You can explicitly set the color scale for each one of your plots, but a better solution would be to simply change the defaults.

# default colors
qplot(data = iris, Petal.Length, Petal.Width, color = Species, size = I(4))

plot of chunk ggplot_default_color_scales_Title

# change default without arguments
scale_colour_discrete <- scale_colour_grey

plot of chunk ggplot_default_color_scales_Title

# change default with arguments
scale_colour_discrete <- function(...) scale_color_brewer(palette = "Set1")

plot of chunk ggplot_default_color_scales_Title

Jan 22, 2013

Parsing gff files with Rcpp for use with ggbio

The ggbio package is a great tool to use for visualizing next generation sequencing data. It combines the graphing power of the ggplot2 package with the biology aware data structures in bioconductor. The package includes support for plotting genes in the standard databases supplied by bioconductor, which works well for heavily studied organisms such as human and mouse.

If you are interested in a less well annotated organism, there is no prebuilt database to pull from. In this case often the gene annoation is described as a general feature format (gff) file. This specification has gone through a number of revisions through the years, the most current of which is gff3.

Unfortunately, there is no current function to import gff3 into the GRangeList format which is used by ggbio to plot genes. This specification also is somewhat complex to parse in R due to there being multiple levels of relationships and optional fields.

The genomeIntervals package contains a function to read gff3 files, however this creates a "Genome_intervals_stranded" object, not the GRangesList we need for ggbio. The easyRNASeq package has a function to convert the Genome_intervals_stranded object into a GRangesList, but that seems like a large unreleated dependency, it would probably be easier to just parse and create the object directly as a GRangesList. Fortunetely using Rcpp, some tips from the above packages and some convienent helper functions it is actually fairly straightforward.

First we create a helper function to do string tokenizing

vector<string> inline split_string(const string &source, const char *delimiter = " ", bool keep_empty = false) {
    vector<string> results;

    size_t prev = 0;
    size_t next = 0;

    while ((next = source.find_first_of(delimiter, prev)) != string::npos) {
        if (keep_empty || (next - prev != 0)) {
            results.push_back(source.substr(prev, next - prev));
        prev = next + 1;
    if (prev < source.size()) {

    return results;

Then the c++ parsing code to convert the gff file to a data frame. Note we have to store the entire file in memory before constructing the data frame to determine the number of columns due to the optional attributes.

// [[Rcpp::export]]
Rcpp::DataFrame gff2df(std::string file, const char *attribute_sep = "=") {
  CharacterVector records;
  vector< vector < string > > column_strings(FIELD_SIZE);
  vector< map< string, string > > attribute_list;
  set< string > attribute_types;
  ifstream in(file.c_str());
  string rec;
  int count=0;
    if(rec.at(0) != '#'){ //not a comment line
      vector< string > strings = split_string(rec,"\t"); 
      for(uint i = 0;i<strings.size()-1;++i){
      vector< string > attribute_strings = split_string(strings.at(strings.size()-1), ";");
      map< string, string> line_attributes;
      for(uint i = 0;i< attribute_strings.size();++i){
        vector< string > pair = split_string(attribute_strings.at(i), attribute_sep);
        line_attributes[pair.at(0)] = pair.at(1);
        if(attribute_types.find(pair.at(0)) == attribute_types.end()){
    else if(rec.find("##FASTA") != string::npos){
  Rcpp::DataFrame result;
  for(uint i = 0;i<FIELD_SIZE;++i){
  for(set< string >::const_iterator it = attribute_types.begin(); it != attribute_types.end(); ++it){
    vector< string > column_data;
    for(vector< map<string, string > >::const_iterator vec_it = attribute_list.begin();vec_it != attribute_list.end();++vec_it){
      if(vec_it->count(*it) == 1){

This gives us the file as a dataframe in R. We then need to transform the data frame into a GRangeList object for ggbio. One problem with constructing the data frame in the C++ code the way that I did it is that all the columns are created as strings, even though a number of the columns are numeric, and the others can probably be factors. Luckily using the all_numeric function from the Hmisc package will do the test and conversion for us.

# from Hmisc library
all_numeric = function(x, what = c("test", "vector"), extras = c(".", "NA")) {
    what <- match.arg(what)
    old <- options(warn = -1)
    x <- sub("[[:space:]]+$", "", x)
    x <- sub("^[[:space:]]+", "", x)
    xs <- x[!x %in% c("", extras)]
    isnum <- !any(is.na(as.numeric(xs)))
    if (what == "test") {
    } else if (isnum) {
    } else {

Then all we have to do is split the data based on seqid and grouping variable to construct the GRangesList object we want.

read_gff = function(file, grouping = "Parent", attribute_sep = "=") {
    data = data.frame(lapply(gff2df(file, attribute_sep), all_numeric, what = "vector"))
    if (!grouping %in% colnames(data)) {
        stop(paste(grouping, "does not exist in", file, ", valid columns are", 
            paste(colnames(data), collapse = " ")))
    stopifnot(grouping %in% colnames(data))
    lapply(split(data, factor(data$seqid)), function(df) {
        split(GRanges(ranges = IRanges(start = df$start, end = df$end), seqnames = df$seqid, 
            strand = df$strand, mcols = df[, !colnames(df) %in% c("seqid", "strand", 
                "start", "end")]), df[, grouping], drop = TRUE)

This function can then be used to read and plot a gene annotation with ggbio

gff = read_gff("data/eden.gff", grouping = "Name")

## Object of class "ggbio"

plot of chunk parsing_gff_gff_plot


Note that parsing gtf files can be done with the same code, you just need to change the attribute seperator from '=' to ' '

gtf = read_gff("data/Mus_musculus.GRCm38.70.gtf", grouping = "gene_id", attribute_sep = " ")

## Object of class "ggbio"

plot of chunk parsing_gff_gtf_plot


The code for all of the above functions are at this gist.

Dec 26, 2012

Plotting manual fitted model predictions using ggplot

ggplot provides convenient smoothing functions for fitting models to data with the built in geom_smooth and stat_smooth methods.

(points = ggplot(data = mtcars, aes(x = hp, y = mpg)) + geom_point())

plot of chunk manual_predictions_ex1

(points_smoothed = points + geom_smooth(method = "lm", se = F))

plot of chunk manual_predictions_ex1

(one_facet <- points_smoothed + facet_wrap(~cyl))

plot of chunk manual_predictions_ex1

When you are faceting data, either spatially or by color/linetype/shape doing the subsetting and model fitting manually can be somewhat daunting.

(two_facet = points_smoothed + facet_grid(cyl ~ gear))

plot of chunk manual_predicitons_ex2

However once you understand the process, and are familiar with the plyr library of functions it is actually very straightforward.

One facet


## Attaching package: 'plyr'

## The following object(s) are masked from 'package:IRanges':
## compact, desc, rename

models = dlply(mtcars, .(cyl), function(df) lm(mpg ~ hp, data = df))
predictions = ldply(models, function(mod) {
    grid = expand.grid(hp = sort(unique(mod$model$hp)))
    grid$pred = predict(mod, newdata = grid)
one_facet + geom_line(data = predictions, aes(y = pred), linetype = "dashed", 
    size = 2)

plot of chunk manual_predictions_one

The only change for two facets is how you break up the models

Two facets

models = dlply(mtcars, .(cyl, gear), function(df) lm(mpg ~ hp, data = df))
predictions = ldply(models, function(mod) {
    grid = expand.grid(hp = sort(unique(mod$model$hp)))
    grid$pred = predict(mod, newdata = grid)
two_facet + geom_line(data = predictions, aes(y = pred), linetype = "dashed", 
    size = 2)

plot of chunk manual_predictions_two

If you want to perform predictions across the full range of data you can use expand.grid with the full dataset rather than just the subset, this is analogous to the fullrange option in stat_smooth

grid = expand.grid(hp = sort(unique(mtcars$hp)))
models = dlply(mtcars, .(cyl, gear), function(df) lm(mpg ~ hp, data = df))
predictions = ldply(models, function(mod) {
    grid$pred = predict(mod, newdata = grid)
points + stat_smooth(fullrange = T, se = F, method = "lm") + facet_grid(cyl ~ 
    gear) + geom_line(data = predictions, aes(y = pred), linetype = "dashed", 
    size = 2)

plot of chunk manual_predictions_full

So you can see that plotting manual predictions is actually very straightforward, and this can be a powerful technique in exploratory data analysis.

Dec 18, 2012

One liner for perl dependencies

If your module is FooBar, and you are using cpanminus then

cpanm perl -MFooBar -e 'print join("\n", keys %INC),"\n"'

will install all the dependencies needed for that module.

This however will not work if you do not have the modules installed to run the script in the first place, but if you install the Devel::Modlist package it is as simple as

cpanm `perl -MDevel::Modlist=stdout,noversion FooBar.pl`
Dec 17, 2012

Setting up a local cpan using cpanminus without root access

When asked why colleagues do not use perl modules in their work, often the response is that they do not know how to install them without having root access to the server they are working on. Cpan can be configured to install to a different base directory, however this requires a number of options to be set correctly, and can be a pain to get set up.

However using cpan minus and the local::lib module makes this process as painless as running three simple commands, easy enough to set up for just about anyone. Note that I turn off testing in the following commands, which I encourage you to do as well, I have found there are some false positive failures, and it will save time as well.

First you have to download cpanminus and install it and the local::lib module, change /foo to the directory you would like to install the modules to

wget -O- http://cpanmin.us | perl - -l /foo App::cpanminus local::lib --notest

Then use the local::lib package to set up the required environment variables to point to your new module path for the current login session

eval $(perl -I /foo/lib/perl5 -Mlocal::lib=/foo)

Finally add that command to a login script for your shell so it will be run automatically when you login, i.e. (.profile, .bash_profile, .zshenv) ect.

echo 'eval $(perl -I /foo/lib/perl5 -Mlocal::lib=/foo)' >> .profile

I also like to set a default --notest, so I don't have to test every module I install

echo export PERL_CPANM_OPT="--notest" >> .profile

Then you can then install a module in the correct directory , e.g. the Statistics::Descriptive package, with

cpanm Statistics::Descriptive

It doesn't get much easier than that!