R: Parsing Fasta Files

Nov 27, 2012

As a quick follow-up to my previous posts about parsing fasta files in perl, python, and ruby I wanted to make a quick note about a efficient way to get the data into R.

	#include <fstream>
	#include <Rcpp.h>

	using namespace Rcpp;

	// [[Rcpp::export]]
	CharacterVector read_fasta(std::string file) {
	CharacterVector records;
	std::ifstream in(file.c_str());
	in.get(); // remove first '>'
	std::string rec;
	while(getline(in,rec,'>')){
	int newLineLoc = rec.find('\n');
	std::string header = rec.substr(0,newLineLoc);
	std::string sequence = rec.substr(newLineLoc+1, rec.length()-newLineLoc-2);
	sequence.erase(std::remove(sequence.begin(),sequence.end(),'\n'),sequence.end());
	records[header]=sequence;
	}
	return(records);
	}

view raw read_fasta.cpp hosted with ❤ by GitHub

library(Rcpp)
sourceCpp("read_fasta.cpp")
library(microbenchmark)
fasta_lengths <- function(file) {
    records = read_fasta(file)
    sapply(records, nchar)
}
microbenchmark(fasta_lengths("Hg19.fa"), times = 1)

And the results

## Unit: seconds
##                       expr   min    lq median    uq   max
## 1 fasta_lengths("Hg19.fa") 33.99 33.99  33.99 33.99 33.99

So this is actually faster than the python implementation, an impressive feat, Rcpp is a very nice package!

Jim Hester

Software Engineer

I’m a Senior Software Engineer at Netflix and R package developer.

R: Parsing Fasta Files

Jim Hester

Software Engineer

Related