Perl vs Python vs Ruby: Fasta reading using Bio packages

Since all the languages I mentioned in my previous post have Bio packages which can parse fasta files, I did a quick comparison of the performance of the three implementations. Here are the implementations, they are highly similar.

Perl

#!/usr/bin/env perl
use warnings;use strict;
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-file => shift, '-format' => 'Fasta');
while(my $rec = $in->next_seq() ){
print join(" ",$rec->display_id,$rec->length)."\n";
}

Ruby

#!/usr/bin/env ruby
require 'bio'
ff = Bio::FlatFile.new(Bio::FastaFormat,ARGF)
ff.each_entry do |record|
puts [record.definition, record.nalen.to_s ].join(" ")
end

Python

#!/usr/bin/env python
import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.argv[1],'fasta'):
print record.id, len(record)
fastaLengths-bio.pl Hg19.fa 65.15s user 11.84s system 99% cpu 1:17.00 total
fastaLengths-bio.rb Hg19.fa 56.07s user 14.18s system 99% cpu 1:10.26 total
fastaLengths-bio.py Hg19.fa 46.85s user 13.11s system 99% cpu 59.969 total

This highlights a major implementation deficiency in the perl and ruby bio projects for reading fasta files as the results here are the exact reverse of the simple parsers from my previous post. This performance regression is due to the bioperl SeqIO method attempting to identify the sequence as dna or protein every time next_seq is called, setting the type in the SeqIO constructor brings the perl implementation back in the lead by a fair margin.

Perl 2

#!/usr/bin/env perl
use warnings;use strict;
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-file => shift, -format => 'Fasta', -alphabet => 'dna');
while(my $rec = $in->next_seq() ){
print join(" ",$rec->display_id,$rec->length)."\n";
}
fastaLengths-bio.pl Hg19.fa 38.50s user 10.76s system 99% cpu 49.267 total
Avatar
Jim Hester
Software Engineer

I’m a Senior Software Engineer at Netflix and R package developer.

Related