Protein Sequence Annotation in the Genome Era: The Annotation Concept of SWISS-PROT + TREMBL

Rolf Apweiler, Alain Gateau, Sergio Contrino, Maria Jesus Martin, Vivien Junker, Claire O'Donovan, Fiona Lang, Nicoletta Mitaritonna, Stephanie Kappus, and Amos Bairoch

SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWlSS-PROT. Since we do not want to dilute the quality standards of SWISSPROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISSPROT. TREMBL consists of computer-annotated entries in SWlSS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISSPROT. While TREMBL is already of immense value, its computer-generated annotation does not match the quality of SWISS-PROTs. The main difference is in the protein functional information attached to sequences. With this in mind, we are dedicating substantial effort to develop and apply computer methods to enhance the functional information attached to TREMBL entries.

