As a quick follow-up to my previous posts about parsing fasta files in perl, python, and ruby I wanted to make a quick note about a efficient way to get the data into R.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <fstream> | |
#include <Rcpp.h> | |
using namespace Rcpp; | |
// [[Rcpp::export]] | |
CharacterVector read_fasta(std::string file) { | |
CharacterVector records; | |
std::ifstream in(file.c_str()); | |
in.get(); // remove first '>' | |
std::string rec; | |
while(getline(in,rec,'>')){ | |
int newLineLoc = rec.find('\n'); | |
std::string header = rec.substr(0,newLineLoc); | |
std::string sequence = rec.substr(newLineLoc+1, rec.length()-newLineLoc-2); | |
sequence.erase(std::remove(sequence.begin(),sequence.end(),'\n'),sequence.end()); | |
records[header]=sequence; | |
} | |
return(records); | |
} |
library(Rcpp)
sourceCpp("read_fasta.cpp")
library(microbenchmark)
fasta_lengths <- function(file) {
records = read_fasta(file)
sapply(records, nchar)
}
microbenchmark(fasta_lengths("Hg19.fa"), times = 1)
And the results
## Unit: seconds
## expr min lq median uq max
## 1 fasta_lengths("Hg19.fa") 33.99 33.99 33.99 33.99 33.99
So this is actually faster than the python implementation, an impressive feat, Rcpp is a very nice package!