Indexing Local Database with dbiflat

dbiflat indexes a flat file database of one or more files and builds EMBL CD-ROM format index files. Major databases such as EBI, Swiss-Prot and TrEMBL distribute unindexed flat file versions. dbiflat indexes these databases. The benefit of using indexed flat files is that we can offer services built on data from major bioinformatics databases without having to connect to them for every query. The alternate is to install and configure SRS or MRS.

Go to your data directory inside EMBOSS.

$ cd /usr/local/share/EMBOSS/data

Create a directory called swissprot.

$ mkdir swissprot

Go inside this directory (otherwise you would need to manually type the address).

$ cd swissprot

Before we start, we need to download a flat file database. UniProt offers database downloads at http://www.uniprot.org/downloads. Download the swissprot TEXT version of UniProtKB/Swiss-Prot.

$ wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

This would download a 300MB+ file named uniprot_sprot.dat.gz. Unzip this file.

$ gunzip uniprot_sprot.dat.gz 
$ ls
uniprot_sprot.dat

Now that we have acquired the file file database, we index it with dbiflat:

$ dbiflat
Index a flat file database
Database name: swissprot
      EMBL : EMBL
     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
        GB : Genbank, DDBJ
    REFSEQ : Refseq
Entry format [SWISS]: 
Database directory [.]: 
Wildcard database filename [*.dat]: uniprot_sprot.dat
Release number [0.0]: 
Index date [00/00/00]: 16/01/09
General log output file [outfile.dbiflat]: 

Hit return for the empty prompts above as we need the default value. Once the indexing is complete, do the following to verify the values.

[root@hunza swissprot]# cat outfile.dbiflat 
########################################
# Program: dbiflat
# Rundate: Fri 16 Jan 2009 13:34:06
# Dbname: swissprot
# Release: 0.0
# Date: 16/01/09
# CurrentDirectory: /usr/local/share/EMBOSS/data/swissprot/
# IndexDirectory: ./
# IndexDirectoryPath: /usr/local/share/EMBOSS/data/swissprot/
# Maxindex: 0
# Fields: 2
#   Field 1: id
#   Field 2: acc
# Directory: ./
# DirectoryPath: /usr/local/share/EMBOSS/data/swissprot/
# Filenames: uniprot_sprot.dat
# Exclude: 
# Files: 1
#   File 1: ./uniprot_sprot.dat
########################################
# Commandline: dbiflat
#    -dbname swissprot
#    -filenames uniprot_sprot.dat
#    -date 16/01/09
########################################

filename: 'uniprot_sprot.dat'
    id: 405506
   acc: 554076

Index acc: maxlen 6 items 539564

Total 1 files 405506 entries (0 duplicates)

Our new indexed database would not be visible to other EMBOSS programs until we edit the emboss.default file. Open this file for editing. Look for the DB swissprot [...] block. If it exists, replace it with the following. If it doesn't exist, add the following.

##########################################################################
# SWISSPROT indexed with dbiflat
##########################################################################

# SWISSPROT: Set the directory to where the database is stored
# Assumed the dbiflat index files are in the same directory

 DB swissprot [
         type: P
         comment: "SWISSPROT sequences"
         method: emblcd
         format: swiss
         dbalias: swissprot
         dir: /usr/local/share/EMBOSS/data/swissprot/
         file: uniprot_sprot.dat
 ]

Save the emboss.default file and try the following example to verify that your database is working.

$ seqret
Reads and writes (returns) sequences
Input (gapped) sequence(s): swissprot:p11217
output sequence(s) [pygm_human.fasta]: 
$ ls
pygm_human.fasta  
$ cat pygm_human.fasta 
>PYGM_HUMAN P11217 RecName: Full=Glycogen phosphorylase, muscle form; EC=2.4.1.1; AltName: Full=Myophosphorylase;
MSRPLSDQEKRKQISVRGLAGVENVTELKKNFNRHLHFTLVKDRNVATPRDYYFALAHTV
RDHLVGRWIRTQQHYYEKDPKRIYYLSLEFYMGRTLQNTMVNLALENACDEATYQLGLDM
EELEEIEEDAGLGNGGLGRLAACFLDSMATLGLAAYGYGIRYEFGIFNQKISGGWQMEEA
DDWLRYGNPWEKARPEFTLPVHFYGHVEHTSQGAKWVDTQVVLAMPYDTPVPGYRNNVVN
TMRLWSAKAPNDFNLKDFNVGGYIQAVLDRNLAENISRVLYPNDNFFEGKELRLKQEYFV
VAATLQDIIRRFKSSKFGCRDPVRTNFDAFPDKVAIQLNDTHPSLAIPELMRILVDLERM
DWDKAWDVTVRTCAYTNHTVLPEALERWPVHLLETLLPRHLQIIYEINQRFLNRVAAAFP
GDVDRLRRMSLVEEGAVKRINMAHLCIAGSHAVNGVARIHSEILKKTIFKDFYELEPHKF
QNKTNGITPRRWLVLCNPGLAEVIAERIGEDFISDLDQLRKLLSFVDDEAFIRDVAKVKQ
ENKLKFAAYLEREYKVHINPNSLFDIQVKRIHEYKRQLLNCLHVITLYNRIKREPNKFFV
PRTVMIGGKAAPGYHMAKMIIRLVTAIGDVVNHDPAVGDRLRVIFLENYRVSLAEKVIPA
ADLSEQISTAGTEASGTGNMKFMLNGALTIGTMDGANVEMAEEAGEENFFIFGMRVEDVD
KLDQRGYNAQEYYDRIPELRQVIEQLSSGFFSPKQPDLFKDIVNMLMHHDRFKVFADYED
YIKCQEKVSALYKNPREWTRMVIRNIATSGKFSSDRTIAQYAREIWGVEPSRQRLPAPDE
AI

seqret was used to query for a protein with id P11217. This entry was retrieved and saved to pygm_human.fasta file.

dbiflat data file size limit

You cannot handle files larger than 2 Gig. So you can use Swiss-Prot but not TrEMBL.