Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

BioPerl

BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics. It is open source and widely used in the bioinformatics community. Bioperl provides software modules for many of the typical tasks of bioinformatics programming. These include:

Retrieving sequences from Swiss-Prot

The following script retrieves a sequence from Swiss-Prot:

#!/usr/bin/perl
use strict;
use Bio::Perl;

# it retrieves sequences from swissprot and generates fasta output
# this script will not work if you are not connected to the Internet

my $s = get_sequence('swiss',$ARGV[0]);
write_sequence(">$ARGV[1]",'fasta',$s);

usage:

$ perl get_swissprot_sequence.pl P11217 a.fasta
$ perl get_swissprot_sequence.pl PYGM_HUMAN b.fasta

Perl commandline parameters are stored in the @ARGV array. The first argument $ARGV[0] stores the Swiss-Prot accession number or identifier. The second argument $ARGV[1] defines the output file.

For more on use SeqIO with files, please refer to http://www.bioperl.org/wiki/Sequence_formats

Creating sequence objects

When working with fasta or other files, you have to first create sequence objects.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::SeqIO;

# create sequence object
my $s = Bio::SeqIO->new( -file => "pygm.fasta", -format => "fasta");
my $st = $s->next_seq;
print $st->seq;

usage:

$ perl create_sequence_object.pl

It takes a file named pygm.fasta as input and creates a sequence object. The last two lines are for printing the sequence.

If you have multiple fasta sequences in a file, SeqIO would create multiple sequence object for you automatically. To print all the sequences, you can use the while loop in the last line.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::SeqIO;

# create sequence objects
my $s = Bio::SeqIO->new( -file => $ARGV[0], -format => "fasta");

# print sequences
while (my $st = $s->next_seq) { print $st->seq; print "\n"; }

usage:

$ perl create_sequence_objects.pl pygm.fasta

Table 1: Sequence Object Methods
Name Returns Example Note
new

Sequence object $so = Bio::Seq->new(-seq => "MPQRAS") create a new one, see Bio::Seq for more
seq sequence string

$seq = $so->seq get or set the sequence
display_id identifier $so->display_id("NP_123456") get or set an identifier
primary_id identifier $so->primary_id(12345) get or set an identifier
desc description $so->desc("Example 1")

get or set a description
accession identifier $acc = $so->accession get or set an identifier
length

length, a number $len = $so->length get the length
alphabet alphabet $so->alphabet('dna') get or set the alphabet ('dna','rna','protein')

subseq sequence string $string = $seq_obj->subseq(10,40) Arguments are start and end
trunc Sequence object

$so2 = $so1->trunc(10,40) Arguments are start and end
revcom Sequence object $so2 = $so1->revcom Reverse complement
translate protein Sequence object $prot_obj = $dna_obj->translate See the Bioperl Tutorial for more
species

Species object $species_obj = $so->species See Bio::Species for more
seq_version version, if available $so->seq_version("1")

get or set a version
keywords keywords, if available @array = $so->keywords get or set keywords
namespace

namespace, if available $so->namespace("Private") get or set the name space
authority authority, if available $so->authority("FlyBase") get or set the organization

get_secondary_accessions array of secondary accessions, if available @accs = $so->get_secondary_accessions get other identifiers
division division, if available (e.g. PRI)

$div = $so->division get division (e.g. "PRI")
molecule molecule type, if available $type = $so->molecule get molecule (e.g. "RNA", "DNA")
get_dates array of dates, if available @dates = $so->get_dates get dates
pid pid, if available $pid = $so->pid

get pid
is_circular Boolean if $so->is_circular { # } get or set

Table source: http://www.bioperl.org/wiki/HOWTO:Beginners

Retrieving sequences from Genbank

Following code retrieves a Genbank sequence and creates an object.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::GenBank;
use Data::Dumper;

# create a GenBank object
my $a = Bio::DB::GenBank->new;
my $b = $a->get_Seq_by_acc($ARGV[0]);

# Dump Data
print Dumper($b);

usage:

$ perl retrieve_genbank_sequence.pl EW695397

Dumper prints the contents of an object.

Blasting a sequence

Aliging sequences using BLAST is the most common task performed in bioinformatics. Here is how you can do it using BioPerl:

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;

# get sequence given an identifier or accession number
my $so = get_sequence('swiss',$ARGV[0]);

# get blast the sequence
my $blast_result = blast_sequence($so);

# write blast results into a file
write_blast(">$ARGV[1]",$blast_result);

usage:

$ chmod 755 blastseq.pl
$ perl blastseq.pl PYGM_HUMAN pygm.blast

The above example retrieves a sequence from Swiss-Prot and then blasts the retrieved sequence against NCBI blast server. The following example blasts a given sequence against NCBI blast server.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;

# get sequence
my $blast_result = blast_sequence('MFVEGGTFASEDDDSASAEDE');

# write output to file
write_blast(">$ARGV[0]",$blast_result);

usage:

$ chmod 755 blastseq2.pl
$ perl blastseq2.pl test.blast

The next example takes a fasta file as input which is used to blast against NCBI blast server.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::SeqIO;

# create sequence object from the fasta file
my $s = Bio::SeqIO->new(
        -file => $ARGV[0],
        -format => "fasta"
);

while (my $st = $s->next_seq)
{
        # uncomment to print fasta sequence
        # print $st->seq;

        # blast against NCBI blast server
        my $blast_result = blast_sequence($st->seq);

        # write blast results to the specified file
        write_blast(">$ARGV[1]",$blast_result);
}

usage:

$ chmod 755 blastseq3.pl
$ perl blastseq3.pl pygm.fasta pygm.blast

ClustalW with BioPerl

BioPerl documentation on ClustalW is great but I faced some problems as a beginner. The following is a code from ClustalW docs modified to make life easier for the beginner.

1. Make sure bioperl-run in installed in addition to BioPerl.
2. Make sure clustalw is installed at executable
3. Set path using the following command (assuming that clustalw is installed at /usr/local/bin/clustalw2):
export CLUSTALDIR=/usr/local/bin/clustalw2

#!/usr/bin/perl
use Bio::AlignIO;
use Bio::Root::IO;
use Bio::Seq;
use Bio::SeqIO;
use Bio::SimpleAlign;
use Bio::TreeIO;

BEGIN { $ENV{CLUSTALDIR} = '/usr/local/bin/clustalw2/' }
use Bio::Tools::Run::Alignment::Clustalw;

#  Build a clustalw alignment factory
  @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
  $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);

  #  Pass the factory a list of sequences to be aligned.
  $inputfilename = 'blastdump/input.fasta';
  $aln = $factory->align($inputfilename); # $aln is a SimpleAlign object.
  # or
  $seq_array_ref = \@seq_array;
  # where @seq_array is an array of Bio::Seq objects
  $aln = $factory->align($seq_array_ref);

  # Or one can pass the factory a pair of (sub)alignments
  #to be aligned against each other, e.g.:
  $aln = $factory->profile_align($aln1,$aln2);
  # where $aln1 and $aln2 are Bio::SimpleAlign objects.

  # Or one can pass the factory an alignment and one or more unaligned
  # sequences to be added to the alignment. For example:
  $aln = $factory->profile_align($aln1,$seq); # $seq is a Bio::Seq object.

  # Get a tree of the sequences
  $tree = $factory->tree(\@seq_array);

  # Get both an alignment and a tree
  ($aln, $tree) = $factory->run(\@seq_array);

  # Do a footprinting analysis on the supplied sequences, getting back the
  # most conserved sub-alignments
  my @results = $factory->footprint(\@seq_array);
  foreach my $result (@results) {
    print $result->consensus_string, "\n";
  }

  # There are various additional options and input formats available.
  # See the DESCRIPTION section that follows for additional details.