Jim Hester

Big Data Bioinformatics My thoughts on next generation sequencing, linux, and programming.
Feb 12, 2014

Just a short post about two useful global aliases I created. ZSH global aliases are basically variables which are expanded before the command is executed. This allows them to be placed anywhere on the line, not just at the start like traditional aliases. Bash (as far as I know) does not have an analog to ZSH global aliases, but I have found them very useful.

Both of them produce exactly the same output (unique lines in a file), but in two different ways.

alias -g U="awk '!a[\$0]++'"
alias -g SU='\(sort | uniq\)'

The first uses awk to hash the lines seen, and only print the line if the current line is not in the hash. This finds the unique lines without sorting, which preserves the original order is usually much much faster. The only issue is it can exhaust system memory if used on extremely large files.

How much faster is hashing for uniques rather than sorting for uniques?

Note these commands use GNU shuf to randomly shuffle

gshuf /usr/share/dict/words | head -n 100000 > words #get 100000 random words
(for i in {1..100};do gshuf words | head -n 1000;done;cat words) | gshuf > dup_words #randomly pick 1000 words to duplicate 100 times, append the full list and randomize

Now lets count unique lines in the files

time awk '!a[$0]++'< dup_words | wc -l
#  100000
#  awk '!a[$0]++' < dup_words  0.14s user 0.00s system 99% cpu 0.149 total
#  wc -l  0.00s user 0.00s system 2% cpu 0.147 total
time (sort | uniq) < dup_words | wc -l
#  100000
#  ( sort | uniq; ) < dup_words  2.45s user 0.01s system 100% cpu 2.438 total
#  wc -l  0.00s user 0.00s system 0% cpu 2.437 total

So you get the same results 16X faster using the hashing strategy.

And lets use the aliases which started this whole mess

U < dup_words | wc -l
SU < dup_words | wc -l

cat dup_words | U | wc -l
cat dup_words | SU | wc -l

All in all this allows you to do sorting faster and with less typing than the ubiquitous | sort | uniq pattern.

Mar 28, 2013

I am pleased to announce my project for generating styled, dynamic html reports knitr_bootstrap. Here is an example of a report.

I have been using knitr to produce reports for around 6 months or so now, first using pandoc to generate the html, lately I have been using the Rmarkdown R package instead. This has a couple of advantages over pandoc. First, it is much easier to install as anyone who has tried to install haskell on CentOS can attest to. Secondly, the Rmarkdown package is the same one which is used for Rstudio. While I do not use RStudio much myself, preferring the power of vim, the ease of use and growing ubiquity means that more people can make use of the package.

The only issue with Rmarkdown is the default stylesheet is fairly simplistic, and the table of contents is not as full featured as the one generated by pandoc. I also would prefer to have plots embedded as thumbnails rather than full size, with the option to click on them to zoom. It would also be nice to style the report using [bootstrap], which gives it a much more polished feel.

After a fair amount of work understanding the jQuery framework and [bootstrap], I have gotten the reports feature full enough for others to try. Current features include

  • Automatic table of contents constructed from h1-4 tags using tocify
  • Output automatically resized to fit the display
  • Code/Output block and plot visibility can be toggled with a mouse click
  • Images are automatically thumbnailed and lightboxed using fancybox
  • Code blocks in >50 languages automatically highlighted using highlight.js
  • Automatically wraps code block with div elements to work with bootstrap
  • Can restyle with bootstrap compatible css
  • Completely offline report usiing knitr_bootstrap_standalone.html

It does require a small patch to the Rmarkdown package to work as I am using it, but it should be an fairly easy to accept, and if not workarounds are possible, if less elegant. Please give it a try and let me know your thoughts!

Mar 11, 2013

ggplot2 is a very nice R plotting library, however some people do not like the default color scales for plots. You can explicitly set the color scale for each one of your plots, but a better solution would be to simply change the defaults.

# default colors
qplot(data = iris, Petal.Length, Petal.Width, color = Species, size = I(4))

plot of chunk ggplot_default_color_scales_Title

# change default without arguments
scale_colour_discrete <- scale_colour_grey
last_plot()

plot of chunk ggplot_default_color_scales_Title

# change default with arguments
scale_colour_discrete <- function(...) scale_color_brewer(palette = "Set1")
last_plot()

plot of chunk ggplot_default_color_scales_Title

Jan 22, 2013

The ggbio package is a great tool to use for visualizing next generation sequencing data. It combines the graphing power of the ggplot2 package with the biology aware data structures in bioconductor. The package includes support for plotting genes in the standard databases supplied by bioconductor, which works well for heavily studied organisms such as human and mouse.

If you are interested in a less well annotated organism, there is no prebuilt database to pull from. In this case often the gene annoation is described as a general feature format (gff) file. This specification has gone through a number of revisions through the years, the most current of which is gff3.

Unfortunately, there is no current function to import gff3 into the GRangeList format which is used by ggbio to plot genes. This specification also is somewhat complex to parse in R due to there being multiple levels of relationships and optional fields.

The genomeIntervals package contains a function to read gff3 files, however this creates a "Genome_intervals_stranded" object, not the GRangesList we need for ggbio. The easyRNASeq package has a function to convert the Genome_intervals_stranded object into a GRangesList, but that seems like a large unreleated dependency, it would probably be easier to just parse and create the object directly as a GRangesList. Fortunetely using Rcpp, some tips from the above packages and some convienent helper functions it is actually fairly straightforward.

First we create a helper function to do string tokenizing

vector<string> inline split_string(const string &source, const char *delimiter = " ", bool keep_empty = false) {
    vector<string> results;

    size_t prev = 0;
    size_t next = 0;

    while ((next = source.find_first_of(delimiter, prev)) != string::npos) {
        if (keep_empty || (next - prev != 0)) {
            results.push_back(source.substr(prev, next - prev));
        }
        prev = next + 1;
    }
    if (prev < source.size()) {
        results.push_back(source.substr(prev));
    }

    return results;
}

Then the c++ parsing code to convert the gff file to a data frame. Note we have to store the entire file in memory before constructing the data frame to determine the number of columns due to the optional attributes.

// [[Rcpp::export]]
Rcpp::DataFrame gff2df(std::string file, const char *attribute_sep = "=") {
  CharacterVector records;
  vector< vector < string > > column_strings(FIELD_SIZE);
  vector< map< string, string > > attribute_list;
  set< string > attribute_types;
  ifstream in(file.c_str());
  string rec;
  int count=0;
  while(getline(in,rec)){
    if(rec.at(0) != '#'){ //not a comment line
      vector< string > strings = split_string(rec,"\t"); 
      for(uint i = 0;i<strings.size()-1;++i){
        column_strings[i].push_back(strings.at(i));
      } 
      vector< string > attribute_strings = split_string(strings.at(strings.size()-1), ";");
      map< string, string> line_attributes;
      for(uint i = 0;i< attribute_strings.size();++i){
        vector< string > pair = split_string(attribute_strings.at(i), attribute_sep);
        line_attributes[pair.at(0)] = pair.at(1);
        if(attribute_types.find(pair.at(0)) == attribute_types.end()){
          attribute_types.insert(pair.at(0));
        }
      }
      attribute_list.push_back(line_attributes);
    }
    else if(rec.find("##FASTA") != string::npos){
      break;
    }
    count++;
  }
  Rcpp::DataFrame result;
  for(uint i = 0;i<FIELD_SIZE;++i){
    result[FIELDS[i]]=column_strings.at(i); 
  }
  for(set< string >::const_iterator it = attribute_types.begin(); it != attribute_types.end(); ++it){
    vector< string > column_data;
    for(vector< map<string, string > >::const_iterator vec_it = attribute_list.begin();vec_it != attribute_list.end();++vec_it){
      if(vec_it->count(*it) == 1){
        column_data.push_back(vec_it->at(*it));
      }
      else{
        column_data.push_back("NA");
      }
    }
    result[*it]=column_data;
  }
  return(result);
}

This gives us the file as a dataframe in R. We then need to transform the data frame into a GRangeList object for ggbio. One problem with constructing the data frame in the C++ code the way that I did it is that all the columns are created as strings, even though a number of the columns are numeric, and the others can probably be factors. Luckily using the all_numeric function from the Hmisc package will do the test and conversion for us.

# from Hmisc library
all_numeric = function(x, what = c("test", "vector"), extras = c(".", "NA")) {
    what <- match.arg(what)
    old <- options(warn = -1)
    on.exit(options(old))
    x <- sub("[[:space:]]+$", "", x)
    x <- sub("^[[:space:]]+", "", x)
    xs <- x[!x %in% c("", extras)]
    isnum <- !any(is.na(as.numeric(xs)))
    if (what == "test") {
        isnum
    } else if (isnum) {
        as.numeric(x)
    } else {
        x
    }
}

Then all we have to do is split the data based on seqid and grouping variable to construct the GRangesList object we want.

read_gff = function(file, grouping = "Parent", attribute_sep = "=") {
    data = data.frame(lapply(gff2df(file, attribute_sep), all_numeric, what = "vector"))
    if (!grouping %in% colnames(data)) {
        stop(paste(grouping, "does not exist in", file, ", valid columns are", 
            paste(colnames(data), collapse = " ")))
    }
    stopifnot(grouping %in% colnames(data))
    lapply(split(data, factor(data$seqid)), function(df) {
        split(GRanges(ranges = IRanges(start = df$start, end = df$end), seqnames = df$seqid, 
            strand = df$strand, mcols = df[, !colnames(df) %in% c("seqid", "strand", 
                "start", "end")]), df[, grouping], drop = TRUE)
    })
}

This function can then be used to read and plot a gene annotation with ggbio

library(ggbio)
source("read_gff.R")
gff = read_gff("data/eden.gff", grouping = "Name")
autoplot(gff[[1]])

## Object of class "ggbio"

plot of chunk parsing_gff_gff_plot

## NULL

Note that parsing gtf files can be done with the same code, you just need to change the attribute seperator from '=' to ' '

library(ggbio)
source("read_gff.R")
gtf = read_gff("data/Mus_musculus.GRCm38.70.gtf", grouping = "gene_id", attribute_sep = " ")
autoplot(gtf[[1]])

## Object of class "ggbio"

plot of chunk parsing_gff_gtf_plot

## NULL

The code for all of the above functions are at this gist.

Dec 26, 2012

ggplot provides convenient smoothing functions for fitting models to data with the built in geom_smooth and stat_smooth methods.

library(ggplot2)
(points = ggplot(data = mtcars, aes(x = hp, y = mpg)) + geom_point())

plot of chunk manual_predictions_ex1

(points_smoothed = points + geom_smooth(method = "lm", se = F))

plot of chunk manual_predictions_ex1

(one_facet <- points_smoothed + facet_wrap(~cyl))

plot of chunk manual_predictions_ex1

When you are faceting data, either spatially or by color/linetype/shape doing the subsetting and model fitting manually can be somewhat daunting.

(two_facet = points_smoothed + facet_grid(cyl ~ gear))

plot of chunk manual_predicitons_ex2

However once you understand the process, and are familiar with the plyr library of functions it is actually very straightforward.

One facet

library(plyr)

## Attaching package: 'plyr'

## The following object(s) are masked from 'package:IRanges':
## 
## compact, desc, rename

models = dlply(mtcars, .(cyl), function(df) lm(mpg ~ hp, data = df))
predictions = ldply(models, function(mod) {
    grid = expand.grid(hp = sort(unique(mod$model$hp)))
    grid$pred = predict(mod, newdata = grid)
    grid
})
one_facet + geom_line(data = predictions, aes(y = pred), linetype = "dashed", 
    size = 2)

plot of chunk manual_predictions_one

The only change for two facets is how you break up the models

Two facets

models = dlply(mtcars, .(cyl, gear), function(df) lm(mpg ~ hp, data = df))
predictions = ldply(models, function(mod) {
    grid = expand.grid(hp = sort(unique(mod$model$hp)))
    grid$pred = predict(mod, newdata = grid)
    grid
})
two_facet + geom_line(data = predictions, aes(y = pred), linetype = "dashed", 
    size = 2)

plot of chunk manual_predictions_two

If you want to perform predictions across the full range of data you can use expand.grid with the full dataset rather than just the subset, this is analogous to the fullrange option in stat_smooth

grid = expand.grid(hp = sort(unique(mtcars$hp)))
models = dlply(mtcars, .(cyl, gear), function(df) lm(mpg ~ hp, data = df))
predictions = ldply(models, function(mod) {
    grid$pred = predict(mod, newdata = grid)
    grid
})
points + stat_smooth(fullrange = T, se = F, method = "lm") + facet_grid(cyl ~ 
    gear) + geom_line(data = predictions, aes(y = pred), linetype = "dashed", 
    size = 2)

plot of chunk manual_predictions_full

So you can see that plotting manual predictions is actually very straightforward, and this can be a powerful technique in exploratory data analysis.

Dec 18, 2012

If your module is FooBar, and you are using cpanminus then

cpanm perl -MFooBar -e 'print join("\n", keys %INC),"\n"'

will install all the dependencies needed for that module.

This however will not work if you do not have the modules installed to run the script in the first place, but if you install the Devel::Modlist package it is as simple as

cpanm `perl -MDevel::Modlist=stdout,noversion FooBar.pl`
Dec 17, 2012

When asked why colleagues do not use perl modules in their work, often the response is that they do not know how to install them without having root access to the server they are working on. Cpan can be configured to install to a different base directory, however this requires a number of options to be set correctly, and can be a pain to get set up.

However using cpan minus and the local::lib module makes this process as painless as running three simple commands, easy enough to set up for just about anyone. Note that I turn off testing in the following commands, which I encourage you to do as well, I have found there are some false positive failures, and it will save time as well.

First you have to download cpanminus and install it and the local::lib module, change /foo to the directory you would like to install the modules to

wget -O- http://cpanmin.us | perl - -l /foo App::cpanminus local::lib --notest

Then use the local::lib package to set up the required environment variables to point to your new module path for the current login session

eval $(perl -I /foo/lib/perl5 -Mlocal::lib=/foo)

Finally add that command to a login script for your shell so it will be run automatically when you login, i.e. (.profile, .bash_profile, .zshenv) ect.

echo 'eval $(perl -I /foo/lib/perl5 -Mlocal::lib=/foo)' >> .profile

I also like to set a default --notest, so I don't have to test every module I install

echo export PERL_CPANM_OPT="--notest" >> .profile

Then you can then install a module in the correct directory , e.g. the Statistics::Descriptive package, with

cpanm Statistics::Descriptive

It doesn't get much easier than that!

Dec 3, 2012

In bioinformatics the common format developed for storing short read alignments is the SAM format which has a binary representation and an ascii text form. There exists a C API to work with the binary format directly, as well as language bindings for most of the common programming languages. Heng Li, the author of the format and the bwa short read aligner, created the samtools program to work with the sam format and convert between bam/sam among many other tasks. However often third-party programs will only read the ascii sam files, which typically have a .sam extension, rather than the binary files, with a .bam extension. In addition, the sam files are completely uncompressed, so can take upwards of 3x or more disk space than the compressed bam files. This gives us motivation to avoid having uncompressed sam files at any point, even as a temporary file which will be deleted.

If the third-party program in question has the ability to accept input from standard input, the solution is very straightforward.

samtools view file.bam | my_program --arguments

However, if the program can only accept named files, often people think the only option is to create the temporary file. Luckily, this is not the case, and linux has long had functionality to treat a pipe as though it were a file. So there exists a very clean solution to the problem.

mkfifo file.sam; samtools view file.bam > file.sam& my_program --arguments file.sam; rm file.sam

This creates a named pipe using mkfifo, converts the file using samtools view and puts that in the background, the runs the third-party program with the named pipe, then removes the pipe once the program is complete. The sam file is never stored on disk.

Nov 27, 2012

As a quick followup to my previous posts about parsing fasta files in perl, python, and ruby I wanted to make a quick note about a efficient way to get the data into R.

library(Rcpp)
sourceCpp("read_fasta.cpp")
library(microbenchmark)
fasta_lengths <- function(file) {
    records = read_fasta(file)
    sapply(records, nchar)
}
microbenchmark(fasta_lengths("Hg19.fa"), times = 1)

And the results

## Unit: seconds
##                       expr   min    lq median    uq   max
## 1 fasta_lengths("Hg19.fa") 33.99 33.99  33.99 33.99 33.99

So this is actually faster than the python implementation, an impressive feat, Rcpp is a very nice package!

Nov 8, 2012

It is sometimes useful to have git tracking files stored in a different location from your working directory, for instance if you would like to backup the git files, and your working directory contains files which are much too large to back up. This could be accomplished by using symlinks, however this is a fairly fragile solution. A more robust solution is that given in this stackoverflow post.

If you already have a git repository you would like to store outside the working directory, say /path/to/repo.git and you are in the working directory

mv .git /path/to/repo.git
echo "gitdir: /path/to/repo.git" > .git
git config --add worktree `pwd -P`

This will move the .git folder to the new location, create a .git file pointing to the new location, and update the git configuration to point to the current working directory. Everything will then work exactly as it did before

If you are initializing a new empty repository you can do everything on one line

git --git-dir=/path/to/repo.git --work.tree=`pwd -P` init && echo "gitdir: 
/path/to/repo.git" > .git

This is a very useful tip for keeping track of all of your repositories, and also allows easy backup, as you can just mirror the folder which contains all of them.