BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics. It is open source and widely used in the bioinformatics community. Bioperl provides software modules for many of the typical tasks of bioinformatics programming. These include:
The following script retrieves a sequence from Swiss-Prot:
#!/usr/bin/perl
use strict;
use Bio::Perl;
# it retrieves sequences from swissprot and generates fasta output
# this script will not work if you are not connected to the Internet
my $s = get_sequence('swiss',$ARGV[0]);
write_sequence(">$ARGV[1]",'fasta',$s);
usage:
$ perl get_swissprot_sequence.pl P11217 a.fasta $ perl get_swissprot_sequence.pl PYGM_HUMAN b.fasta
Perl commandline parameters are stored in the @ARGV array. The first argument $ARGV[0] stores the Swiss-Prot accession number or identifier. The second argument $ARGV[1] defines the output file.
For more on use SeqIO with files, please refer to http://www.bioperl.org/wiki/Sequence_formats
When working with fasta or other files, you have to first create sequence objects.
#!/usr/bin/perl use strict; use warnings; use Bio::Perl; use Bio::SeqIO; # create sequence object my $s = Bio::SeqIO->new( -file => "pygm.fasta", -format => "fasta"); my $st = $s->next_seq; print $st->seq;
usage:
$ perl create_sequence_object.pl
It takes a file named pygm.fasta as input and creates a sequence object. The last two lines are for printing the sequence.
If you have multiple fasta sequences in a file, SeqIO would create multiple sequence object for you automatically. To print all the sequences, you can use the while loop in the last line.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::SeqIO;
# create sequence objects
my $s = Bio::SeqIO->new( -file => $ARGV[0], -format => "fasta");
# print sequences
while (my $st = $s->next_seq) { print $st->seq; print "\n"; }
usage:
$ perl create_sequence_objects.pl pygm.fasta
| Name | Returns | Example | Note |
|---|---|---|---|
| new | Sequence object | $so = Bio::Seq->new(-seq => "MPQRAS") | create a new one, see Bio::Seq for more |
| seq | sequence string | $seq = $so->seq | get or set the sequence |
| display_id | identifier | $so->display_id("NP_123456") | get or set an identifier |
| primary_id | identifier | $so->primary_id(12345) | get or set an identifier |
| desc | description | $so->desc("Example 1") | get or set a description |
| accession | identifier | $acc = $so->accession | get or set an identifier |
| length | length, a number | $len = $so->length | get the length |
| alphabet | alphabet | $so->alphabet('dna') | get or set the alphabet ('dna','rna','protein') |
| subseq | sequence string | $string = $seq_obj->subseq(10,40) | Arguments are start and end |
| trunc | Sequence object | $so2 = $so1->trunc(10,40) | Arguments are start and end |
| revcom | Sequence object | $so2 = $so1->revcom | Reverse complement |
| translate | protein Sequence object | $prot_obj = $dna_obj->translate | See the Bioperl Tutorial for more |
| species | Species object | $species_obj = $so->species | See Bio::Species for more |
| seq_version | version, if available | $so->seq_version("1") | get or set a version |
| keywords | keywords, if available | @array = $so->keywords | get or set keywords |
| namespace | namespace, if available | $so->namespace("Private") | get or set the name space |
| authority | authority, if available | $so->authority("FlyBase") | get or set the organization |
| get_secondary_accessions | array of secondary accessions, if available | @accs = $so->get_secondary_accessions | get other identifiers |
| division | division, if available (e.g. PRI) | $div = $so->division | get division (e.g. "PRI") |
| molecule | molecule type, if available | $type = $so->molecule | get molecule (e.g. "RNA", "DNA") |
| get_dates | array of dates, if available | @dates = $so->get_dates | get dates |
| pid | pid, if available | $pid = $so->pid | get pid |
| is_circular | Boolean | if $so->is_circular { # } | get or set |
Table source: http://www.bioperl.org/wiki/HOWTO:Beginners
Following code retrieves a Genbank sequence and creates an object.
#!/usr/bin/perl use strict; use warnings; use Bio::DB::GenBank; use Data::Dumper; # create a GenBank object my $a = Bio::DB::GenBank->new; my $b = $a->get_Seq_by_acc($ARGV[0]); # Dump Data print Dumper($b);
usage:
$ perl retrieve_genbank_sequence.pl EW695397
Dumper prints the contents of an object.
Aliging sequences using BLAST is the most common task performed in bioinformatics. Here is how you can do it using BioPerl:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
# get sequence given an identifier or accession number
my $so = get_sequence('swiss',$ARGV[0]);
# get blast the sequence
my $blast_result = blast_sequence($so);
# write blast results into a file
write_blast(">$ARGV[1]",$blast_result);
usage:
$ chmod 755 blastseq.pl $ perl blastseq.pl PYGM_HUMAN pygm.blast
The above example retrieves a sequence from Swiss-Prot and then blasts the retrieved sequence against NCBI blast server. The following example blasts a given sequence against NCBI blast server.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
# get sequence
my $blast_result = blast_sequence('MFVEGGTFASEDDDSASAEDE');
# write output to file
write_blast(">$ARGV[0]",$blast_result);
usage:
$ chmod 755 blastseq2.pl $ perl blastseq2.pl test.blast
The next example takes a fasta file as input which is used to blast against NCBI blast server.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::SeqIO;
# create sequence object from the fasta file
my $s = Bio::SeqIO->new(
-file => $ARGV[0],
-format => "fasta"
);
while (my $st = $s->next_seq)
{
# uncomment to print fasta sequence
# print $st->seq;
# blast against NCBI blast server
my $blast_result = blast_sequence($st->seq);
# write blast results to the specified file
write_blast(">$ARGV[1]",$blast_result);
}
usage:
$ chmod 755 blastseq3.pl $ perl blastseq3.pl pygm.fasta pygm.blast
BioPerl documentation on ClustalW is great but I faced some problems as a beginner. The following is a code from ClustalW docs modified to make life easier for the beginner.
1. Make sure bioperl-run in installed in addition to BioPerl.
2. Make sure clustalw is installed at executable
3. Set path using the following command (assuming that clustalw is installed at /usr/local/bin/clustalw2):
export CLUSTALDIR=/usr/local/bin/clustalw2
#!/usr/bin/perl
use Bio::AlignIO;
use Bio::Root::IO;
use Bio::Seq;
use Bio::SeqIO;
use Bio::SimpleAlign;
use Bio::TreeIO;
BEGIN { $ENV{CLUSTALDIR} = '/usr/local/bin/clustalw2/' }
use Bio::Tools::Run::Alignment::Clustalw;
# Build a clustalw alignment factory
@params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
$factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
# Pass the factory a list of sequences to be aligned.
$inputfilename = 'blastdump/input.fasta';
$aln = $factory->align($inputfilename); # $aln is a SimpleAlign object.
# or
$seq_array_ref = \@seq_array;
# where @seq_array is an array of Bio::Seq objects
$aln = $factory->align($seq_array_ref);
# Or one can pass the factory a pair of (sub)alignments
#to be aligned against each other, e.g.:
$aln = $factory->profile_align($aln1,$aln2);
# where $aln1 and $aln2 are Bio::SimpleAlign objects.
# Or one can pass the factory an alignment and one or more unaligned
# sequences to be added to the alignment. For example:
$aln = $factory->profile_align($aln1,$seq); # $seq is a Bio::Seq object.
# Get a tree of the sequences
$tree = $factory->tree(\@seq_array);
# Get both an alignment and a tree
($aln, $tree) = $factory->run(\@seq_array);
# Do a footprinting analysis on the supplied sequences, getting back the
# most conserved sub-alignments
my @results = $factory->footprint(\@seq_array);
foreach my $result (@results) {
print $result->consensus_string, "\n";
}
# There are various additional options and input formats available.
# See the DESCRIPTION section that follows for additional details.