R | Jim Hester

R: Parsing Fasta Files

As a quick follow-up to my previous posts about parsing fasta files in perl, python, and ruby I wanted to make a quick note about a efficient way to get the data into R. library(Rcpp) sourceCpp("read_fasta.cpp") library(microbenchmark) fasta_lengths <- function(file) { records = read_fasta(file) sapply(records, nchar) } microbenchmark(fasta_lengths("Hg19.fa"), times = 1) And the results ## Unit: seconds ## expr min lq median uq max ## 1 fasta_lengths("Hg19.fa") 33.99 33.

External Git Repositories

It is sometimes useful to have git tracking files stored in a different location from your working directory, for instance if you would like to backup the git files, and your working directory contains files which are much too large to back up. This could be accomplished by using symlinks, however this is a fairly fragile solution. A more robust solution is that given in this stackoverflow post. If you already have a git repository you would like to store outside the working directory, say /path/to/repo.

Commandline Java applications from symlinks

I dislike java commandline tools for a number of reasons, including their typical extreme verbosity in arguments and lack of one letter substitutes. Another reason is that typically they do not have a simple shortcut which can be used to run them, you have to do something ugly like java -jar lib/blah/blue/file.jar other_arguments Even if the developer is nice enough to include a wrapping script they often assume you will be running the program from the directory the java program resides in.

Rstudio server through SSH tunnel

Our Rstudio server instance runs on a server which is not directly connected to the internet at large. We can connect to it through a ssh tunnel to an intermediate server. This works fine for the command line, but to access the Rstudio instance we need to be able to connect using our browser. This can be accomplished easily on linux/OSX by the following ssh command. Assume the tunnel server is tunnel, remote is remote, rstudio port is 8787

Per Directory, Cross Server History in ZSH

I do most of my work sshing into linux boxes on the command line. In this environment having a nice history of previous commands is of enormous benefit. I also use tmux terminal multiplexer to have multiple persistent terminal windows. Using bash’s default history with this setup is an excessive in frustration. All of the terminal windows have their own history, and the history is appended/overwritten when a given window is exited, so your history can get hopelessly confused based on the order in which you close your terminal sessions.

Perl vs Python vs Ruby: Restriction enzyme regular expression performance

As a continuation of my previous posts comparing the three major scripting languages used in bioinformatics I wanted to take a look at the regular expression performance of the three languages. A common use case of regular expressions in bioinformatics is searching for restriction enzyme cut sites in a genome of interest. To benchmark this case I downloaded a list of known restriction enzymes from REBASE in the simple bionet format, then parsed that format and converted it into regular expressions with the this code.

Perl vs Python vs Ruby: Fasta reading using Bio packages

Since all the languages I mentioned in my previous post have Bio packages which can parse fasta files, I did a quick comparison of the performance of the three implementations. Here are the implementations, they are highly similar. Perl Ruby Python fastaLengths-bio.pl Hg19.fa 65.15s user 11.84s system 99% cpu 1:17.00 total fastaLengths-bio.rb Hg19.fa 56.07s user 14.18s system 99% cpu 1:10.26 total fastaLengths-bio.py Hg19.fa 46.85s user 13.