Since all the languages I mentioned in my previous post have Bio packages which can parse fasta files, I did a quick comparison of the performance of the three implementations. Here are the implementations, they are highly similar.
Perl
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env perl | |
use warnings;use strict; | |
use Bio::SeqIO; | |
my $in = Bio::SeqIO->new(-file => shift, '-format' => 'Fasta'); | |
while(my $rec = $in->next_seq() ){ | |
print join(" ",$rec->display_id,$rec->length)."\n"; | |
} |
Ruby
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
require 'bio' | |
ff = Bio::FlatFile.new(Bio::FastaFormat,ARGF) | |
ff.each_entry do |record| | |
puts [record.definition, record.nalen.to_s ].join(" ") | |
end |
Python
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
import sys | |
from Bio import SeqIO | |
for record in SeqIO.parse(sys.argv[1],'fasta'): | |
print record.id, len(record) |
fastaLengths-bio.pl Hg19.fa 65.15s user 11.84s system 99% cpu 1:17.00 total
fastaLengths-bio.rb Hg19.fa 56.07s user 14.18s system 99% cpu 1:10.26 total
fastaLengths-bio.py Hg19.fa 46.85s user 13.11s system 99% cpu 59.969 total
This highlights a major implementation deficiency in the perl and ruby bio projects for reading fasta files as the results here are the exact reverse of the simple parsers from my previous post. This performance regression is due to the bioperl SeqIO method attempting to identify the sequence as dna or protein every time next_seq is called, setting the type in the SeqIO constructor brings the perl implementation back in the lead by a fair margin.
Perl 2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env perl | |
use warnings;use strict; | |
use Bio::SeqIO; | |
my $in = Bio::SeqIO->new(-file => shift, -format => 'Fasta', -alphabet => 'dna'); | |
while(my $rec = $in->next_seq() ){ | |
print join(" ",$rec->display_id,$rec->length)."\n"; | |
} |
fastaLengths-bio.pl Hg19.fa 38.50s user 10.76s system 99% cpu 49.267 total