Perl vs Python vs Ruby: Fasta reading using Bio packages

Jul 24, 2012

Since all the languages I mentioned in my previous post have Bio packages which can parse fasta files, I did a quick comparison of the performance of the three implementations. Here are the implementations, they are highly similar.

Perl

	#!/usr/bin/env perl

	use warnings;use strict;
	use Bio::SeqIO;

	my $in = Bio::SeqIO->new(-file => shift, '-format' => 'Fasta');

	while(my $rec = $in->next_seq() ){
	print join(" ",$rec->display_id,$rec->length)."\n";
	}

view raw fasta-bioperl.pl hosted with ❤ by GitHub

Ruby

	#!/usr/bin/env ruby

	require 'bio'

	ff = Bio::FlatFile.new(Bio::FastaFormat,ARGF)
	ff.each_entry do \|record\|
	puts [record.definition, record.nalen.to_s ].join(" ")
	end

view raw fasta-bioruby.rb hosted with ❤ by GitHub

Python

	#!/usr/bin/env python

	import sys
	from Bio import SeqIO

	for record in SeqIO.parse(sys.argv[1],'fasta'):
	print record.id, len(record)

view raw fasta-biopython.py hosted with ❤ by GitHub

fastaLengths-bio.pl Hg19.fa 65.15s user 11.84s system 99% cpu 1:17.00 total
fastaLengths-bio.rb Hg19.fa 56.07s user 14.18s system 99% cpu 1:10.26 total
fastaLengths-bio.py Hg19.fa 46.85s user 13.11s system 99% cpu 59.969 total

This highlights a major implementation deficiency in the perl and ruby bio projects for reading fasta files as the results here are the exact reverse of the simple parsers from my previous post. This performance regression is due to the bioperl SeqIO method attempting to identify the sequence as dna or protein every time next_seq is called, setting the type in the SeqIO constructor brings the perl implementation back in the lead by a fair margin.

Perl 2

	#!/usr/bin/env perl

	use warnings;use strict;
	use Bio::SeqIO;

	my $in = Bio::SeqIO->new(-file => shift, -format => 'Fasta', -alphabet => 'dna');

	while(my $rec = $in->next_seq() ){
	print join(" ",$rec->display_id,$rec->length)."\n";
	}

view raw fasta-bioperl2.pl hosted with ❤ by GitHub

fastaLengths-bio.pl Hg19.fa 38.50s user 10.76s system 99% cpu 49.267 total

Jim Hester

Software Engineer

I’m a Senior Software Engineer at Netflix and R package developer.