R: Parsing Fasta Files

As a quick follow-up to my previous posts about parsing fasta files in perl, python, and ruby I wanted to make a quick note about a efficient way to get the data into R.

#include <fstream>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterVector read_fasta(std::string file) {
CharacterVector records;
std::ifstream in(file.c_str());
in.get(); // remove first '>'
std::string rec;
while(getline(in,rec,'>')){
int newLineLoc = rec.find('\n');
std::string header = rec.substr(0,newLineLoc);
std::string sequence = rec.substr(newLineLoc+1, rec.length()-newLineLoc-2);
sequence.erase(std::remove(sequence.begin(),sequence.end(),'\n'),sequence.end());
records[header]=sequence;
}
return(records);
}
view raw read_fasta.cpp hosted with ❤ by GitHub
library(Rcpp)
sourceCpp("read_fasta.cpp")
library(microbenchmark)
fasta_lengths <- function(file) {
    records = read_fasta(file)
    sapply(records, nchar)
}
microbenchmark(fasta_lengths("Hg19.fa"), times = 1)

And the results

## Unit: seconds
##                       expr   min    lq median    uq   max
## 1 fasta_lengths("Hg19.fa") 33.99 33.99  33.99 33.99 33.99

So this is actually faster than the python implementation, an impressive feat, Rcpp is a very nice package!

Avatar
Jim Hester
Software Engineer

I’m a Senior Software Engineer at Netflix and R package developer.

Related