Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

EMBOSS

EMBOSS (The European Molecular Biology Open Software Suite) is a free open source software suite which is capable of handling bioinformatics problems. It contains extensive libraries for bioinformatics including:

Installing EMBOSS

Following are the steps to install EMBOSS on a Linux system:

  1. Download stable version of Emboss from http://emboss.sourceforge.net/download/
  2. tar xzvf EMBOSS-xx.tar.gz
  3. cd EMBOSS-x.x
  4. ./configure
  5. make
  6. make install
  7. to test, type wossname

wossname

All Emboss programs run from the Unix command line and wossname is no exception. wossname produce a list of emboss applications.

> wossname

> wossname seqret

> wossname nucleotide

seqret is a program and nucleotide is a keyword. EMBOSS programs like wossname have many parameters. To see a list of parameters type:

wossname -opt

EMBOSS: showdb is not showing any database

Emboss can access various biological databases automatically. However, this need to be configured. First of all type:

$ showdb

If you get line nothing in the list of databases, this document is for you. First we locate the emboss.default.template file.

$ locate emboss.default.template

Then cd to that directory and create copy.

$ cp emboss.default.template emboss.default

Open emboss.default and uncomment the databases you desire by deleting # symbol at the beginning of the concerned lines.

$ showdb

You should see your databases now.

Indexing Local Database with dbiflat

dbiflat indexes a flat file database of one or more files and builds EMBL CD-ROM format index files. Major databases such as EBI, Swiss-Prot and TrEMBL distribute unindexed flat file versions. dbiflat indexes these databases. The benefit of using indexed flat files is that we can offer services built on data from major bioinformatics databases without having to connect to them for every query. The alternate is to install and configure SRS or MRS.

Go to your data directory inside EMBOSS.

$ cd /usr/local/share/EMBOSS/data

Create a directory called swissprot.

$ mkdir swissprot

Go inside this directory (otherwise you would need to manually type the address).

$ cd swissprot

Before we start, we need to download a flat file database. UniProt offers database downloads at http://www.uniprot.org/downloads. Download the swissprot TEXT version of UniProtKB/Swiss-Prot.

$ wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

This would download a 300MB+ file named uniprot_sprot.dat.gz. Unzip this file.

$ gunzip uniprot_sprot.dat.gz 
$ ls
uniprot_sprot.dat

Now that we have acquired the file file database, we index it with dbiflat:

$ dbiflat
Index a flat file database
Database name: swissprot
      EMBL : EMBL
     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
        GB : Genbank, DDBJ
    REFSEQ : Refseq
Entry format [SWISS]: 
Database directory [.]: 
Wildcard database filename [*.dat]: uniprot_sprot.dat
Release number [0.0]: 
Index date [00/00/00]: 16/01/09
General log output file [outfile.dbiflat]: 

Hit return for the empty prompts above as we need the default value. Once the indexing is complete, do the following to verify the values.

[root@hunza swissprot]# cat outfile.dbiflat 
########################################
# Program: dbiflat
# Rundate: Fri 16 Jan 2009 13:34:06
# Dbname: swissprot
# Release: 0.0
# Date: 16/01/09
# CurrentDirectory: /usr/local/share/EMBOSS/data/swissprot/
# IndexDirectory: ./
# IndexDirectoryPath: /usr/local/share/EMBOSS/data/swissprot/
# Maxindex: 0
# Fields: 2
#   Field 1: id
#   Field 2: acc
# Directory: ./
# DirectoryPath: /usr/local/share/EMBOSS/data/swissprot/
# Filenames: uniprot_sprot.dat
# Exclude: 
# Files: 1
#   File 1: ./uniprot_sprot.dat
########################################
# Commandline: dbiflat
#    -dbname swissprot
#    -filenames uniprot_sprot.dat
#    -date 16/01/09
########################################

filename: 'uniprot_sprot.dat'
    id: 405506
   acc: 554076

Index acc: maxlen 6 items 539564

Total 1 files 405506 entries (0 duplicates)

Our new indexed database would not be visible to other EMBOSS programs until we edit the emboss.default file. Open this file for editing. Look for the DB swissprot [...] block. If it exists, replace it with the following. If it doesn't exist, add the following.

##########################################################################
# SWISSPROT indexed with dbiflat
##########################################################################

# SWISSPROT: Set the directory to where the database is stored
# Assumed the dbiflat index files are in the same directory

 DB swissprot [
         type: P
         comment: "SWISSPROT sequences"
         method: emblcd
         format: swiss
         dbalias: swissprot
         dir: /usr/local/share/EMBOSS/data/swissprot/
         file: uniprot_sprot.dat
 ]

Save the emboss.default file and try the following example to verify that your database is working.

$ seqret
Reads and writes (returns) sequences
Input (gapped) sequence(s): swissprot:p11217
output sequence(s) [pygm_human.fasta]: 
$ ls
pygm_human.fasta  
$ cat pygm_human.fasta 
>PYGM_HUMAN P11217 RecName: Full=Glycogen phosphorylase, muscle form; EC=2.4.1.1; AltName: Full=Myophosphorylase;
MSRPLSDQEKRKQISVRGLAGVENVTELKKNFNRHLHFTLVKDRNVATPRDYYFALAHTV
RDHLVGRWIRTQQHYYEKDPKRIYYLSLEFYMGRTLQNTMVNLALENACDEATYQLGLDM
EELEEIEEDAGLGNGGLGRLAACFLDSMATLGLAAYGYGIRYEFGIFNQKISGGWQMEEA
DDWLRYGNPWEKARPEFTLPVHFYGHVEHTSQGAKWVDTQVVLAMPYDTPVPGYRNNVVN
TMRLWSAKAPNDFNLKDFNVGGYIQAVLDRNLAENISRVLYPNDNFFEGKELRLKQEYFV
VAATLQDIIRRFKSSKFGCRDPVRTNFDAFPDKVAIQLNDTHPSLAIPELMRILVDLERM
DWDKAWDVTVRTCAYTNHTVLPEALERWPVHLLETLLPRHLQIIYEINQRFLNRVAAAFP
GDVDRLRRMSLVEEGAVKRINMAHLCIAGSHAVNGVARIHSEILKKTIFKDFYELEPHKF
QNKTNGITPRRWLVLCNPGLAEVIAERIGEDFISDLDQLRKLLSFVDDEAFIRDVAKVKQ
ENKLKFAAYLEREYKVHINPNSLFDIQVKRIHEYKRQLLNCLHVITLYNRIKREPNKFFV
PRTVMIGGKAAPGYHMAKMIIRLVTAIGDVVNHDPAVGDRLRVIFLENYRVSLAEKVIPA
ADLSEQISTAGTEASGTGNMKFMLNGALTIGTMDGANVEMAEEAGEENFFIFGMRVEDVD
KLDQRGYNAQEYYDRIPELRQVIEQLSSGFFSPKQPDLFKDIVNMLMHHDRFKVFADYED
YIKCQEKVSALYKNPREWTRMVIRNIATSGKFSSDRTIAQYAREIWGVEPSRQRLPAPDE
AI

seqret was used to query for a protein with id P11217. This entry was retrieved and saved to pygm_human.fasta file.