There are two major protein sequence resources:
- UniProt = Swiss-Prot + TrEMBL + PIR
- NCBI-nr = Swiss-Prot + GenPept + PIR + RefSeq + PDB + PRF
In addition, there are several different specialized protein databases.
UniProt
UniProt is a central resource for protein sequence and function. The UniProt consortium (since 2003) consists of EMBL, SIB, and PIR. PIR is no longer being updated. It now only functions as a archive. UniProt itself is divided into several components.
UniProtKB/TrEMBL
UniProtKB/TrEMBL contains computer annotated protein sequences. TrEMBL entries are produced by translating nucleic acid sequences (CDS) in EMBL using computer tools. In addition, it includes data from PIR. TrEMBL suffers from poor submission of annotated CDS.
TrEMBL is a platform for the improvement of automated annotation tools. A TrEMBL entry is created after applying many annotation tools such as SignalP, TMHMM, REP, etc. Then evidence tags are added to any part of a TrEMBL entry not derived from the original EMBL entry.
UniProtKB/Swiss-Prot
UniProtKB/TrEMBL contains manually annotated protein sequences. Swiss-Prot entries are produced by manually annotating TrEMBL entries. Before creating a Swiss-Prot entry, the sequence is checked and analyzed. The data is cross-checked with literature and external scientific expertise. Once an entry is moved to Swiss-Prot, it is deleted from TrEMBL. Data in Swiss-Prot does not migrate to TrEMBL. Together, Swiss-Prot and TrEMBL provide all known protein sequences in the public domain.
The goals of Swiss-Prot are:
- Non-redundant: (one entry - one gene - one specie)
- Maximum manual annotation: maximum annotation of protein diversity
- Maximum links to other databases
A Swiss-Prot Entry contains:
- ID and accession number
- names and taxonomy
- references
- comments
- cross-references
- keywords
- features
- sequence
UniRef
One UniRef100 entry contains all identical sequences including fragments.
One UniRef90 entry contains sequences that have at least 90% or more identity.
One UniRef50 entry contains sequences that have at least 50% or more identity.
UniParc
UniParc are raw archived protein sequences.
Sequences and information in UniProt is accessible via text search, BLAST similarity search, and FTP.