1) All Properties Database.
This database contains the sequences from all the databases from 2 to 29 in PPT-DB. Which provide the facility to search all the databases together and produce a consensus result. The databases details are below.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
Query: PCRDALMQEYDDKWHQNGLVMDKWFILQATSPAANVLETVRGLLQHRSFTMSNPNRIRSL
B hair: ---------------------BBBBBBBBBTTTTTTTTBBBBBBBBB-------------
ASA: 612640053005603612100010030102041940162034018151031400000300
B turn: 111----------11113333------1111-1111---------11111111-------
Edg Cent: ---------------EEEE----EEE--------------EEEE----------------
1-S2: 11112-211111111111112211111111111111-224111111111122113211-1
Sec Struc: CCHHHHHHHHHHHHCCCHHHHHHHHHHHHHCCCCCHHHHHHHHHCCCCCCCCCHHHHHHH
% Struc: H=42.2; E=26.5; C=29.4
BF: 211100110010001111111110101101001001111110101110011000001011
Mem helix: -----------------TTTTTTTTTTTTTT-----------------------------
Mem barrel:--------------EEEEEEEEEEE--EEEEEEEEEEE----------------------
Coil: --------------------------------gabcdefgabcdefgabcdefgabcde--
S-S: AFLRTHVITGKIKVTATTNISDNSGCCLMLAINSGVRGKYSTDVYTICSQDSMTWNPGCK
SPdb: SSSSSSSSSSSSSSSSSSSS-----------------------------------------
SigPep: SSSSSSSSSSSSSSSSSSS------------------------------------------
CO: CO=28.3; Abs_CO=21.5
Fold rate: ln(k)=-6.91
2) Cytoplasmic (water-soluble) Secondary Structure DB.
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The secondary structures were determined using VADAR [4].
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
CCHHHHHHHHCCCCEEEEEEECCHHHHHHHCCCHHHHHHHHCCCC
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
CEEECCCCCEEEEEECCCCCCHHHHHHHHHHHHHCCCCCCEEECC
INASSTGLK
CCHHHHCCC
3) EVA Test Set Secondary Structure DB
This database contains sequences along with their secondary structure assignments from the EVA [1] test set.
This is a set of non-redundant, water-soluble, sequence-unique (<35% identity) proteins derived from the PDB.
The secondary structure assignments were generated using VADAR [2]. The data set is also available at the Proteus [3]
website.
References:
1. Montgomerie S, Sundararaj S, Gallin WJ, Wishart DS. Improving the accuracy of protein secondary structure prediction using structural alignment. BMC Bioinformatics. 2006 Jun 14;7:301.
2. Rost B, Eyrich VA. EVA: large-scale analysis of secondary structure prediction.
Proteins. 2001;Suppl 5:192-9.
3. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
CCHHHHHHHHCCCCEEEEEEECCHHHHHHHCCCHHHHHHHHCCCC
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
CEEECCCCCEEEEEECCCCCCHHHHHHHHHHHHHCCCCCCEEECC
INASSTGLK
CCHHHHCCC
4) Transmembrane Helix Secondary Structure DB
This database contains the sequences and transmembrane helical assignments for proteins of known 3D structure. Candidate membrane proteins were identified from several sources including TMPDB [1], manual analysis of the PDB [2], and screening of the PDB with TMHMM [3]. Transmembrane alpha-helical regions were manually annotated by members of Dr. David Wishart's lab using secondary structures identified by DSSP [4] and VADAR [5] as well as descriptions contained in the original literature.
References:
1. Ikeda M, Arai M, Okuno T, Shimizu T. TMPDB: a database of experimentally-characterized transmembrane topologies. Nucleic Acids Res. 2003 Jan 1;31(1):406-9.
2. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007 Jan;35(Database issue):D301-3.
3. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577-637.
4. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
5. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQLMIVIAGLMIVPGYWLVARKEALKHLIAVNMAQPGDSGG
-----HHHHHHHHHHHHHHHHHHHHH-------------------
ANGLKCNMCSVLRIIVGWYFYLAGITAAILVTQNGCLDIPANNAL
-------HHHHHHHHHHHHHHHHHHHH------------------
INASSTGLK
---------
5) TMH Benchmark Test Set DB
This database contains the sequences of selected membrane and non-membrane proteins along with their secondary structure assignments from the TMH-Benchmark [1] test set. The secondary structure assignments for proteins of known 3D structure (i.e. high resolution structures) were generated using VADAR [2] and DSSP [3] in combination with manual inspection of PDB files and PDB images. Membrane assignments for low resolution structures were manually generated using TMHMM [4] in combination with experimental data published in the literature.
References:
1. Kernytsky A, Rost B. Static benchmarking of membrane helix predictions. Nucleic Acids Res. 2003 Jul 1;31(13):3642-4.
2. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577-637.
3. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
4. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQLMIVIAGLMIVPGYWLVARKEALKHLIAVNMAQPGDSGG
-----HHHHHHHHHHHHHHHHHHHHH-------------------
ANGLKCNMCSVLRIIVGWYFYLAGITAAILVTQNGCLDIPANNAL
-------HHHHHHHHHHHHHHHHHHHH------------------
INASSTGLK
---------
6) Transmembrane Barrel Secondary Structure DB
This database contains the sequences and transmembrane beta strand assignments for transmembrane barrel proteins of known 3D structure. Candidate membrane proteins were identified from several sources including published lists [1], manual analysis of the PDB [2], and screening of the PDB with TMB-HUNT [3]. Transmembrane beta strands were manually annotated by members of Dr. David Wishart's lab using secondary structures identified by DSSP [4] and VADAR [5] as well as descriptions contained in the original literature.
References:
1. Bagos PG, Liakopoulos TD, Hamodrakas SJ. Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method.
BMC Bioinformatics. 2005 Jan 12;6:7.
2. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007 Jan;35(Database issue):D301-3.
3. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577-637.
4. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
5. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
-------EEEEEEEEEEEEE-------EEEEEEEEEEEEE-----
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
-----EEEEEEEEEEEE--------EEEEEEEEEEE-----EEEE
INASSTGLK
EEEEE----
7) Cytoplasmic (water-soluble) Secondary Structure % Content DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The secondary structure content (%helical residues, %beta strand residues and %coil) residues was determined in part, using VADAR [4].
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name; SwissProt ID; PDB ID;H=26.3;E=23.6;C=50.1
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
INASSTGLK
8) Beta Turn DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The identity and position of the beta turns was determined by VADAR [4] using definitions suggested by Wilmot and Thornton [5].
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
5. Wilmot CM, Thornton JM. Analysis and prediction of the different types of beta-turn in proteins. J Mol Biol. 1988 Sep 5;203(1):221-32.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
--------------1111---------2222--------iiii--
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
------3333------------4444-------!!!!--------
where
1111 = type I turn
2222 = type II turn
iiii = type I' turn
3333 = type III turn
%%%% = type III' turn
4444 = type IV turn
!!!! = type II' turn
9) Coiled-coil Database
This database consists of training sequences of Paircoil2[1] program. These sequences were submitted to Paircoil2 program for identification of coiled-coil fold. The coiled-coil regions are defined by identification of heptad repeats in the sequence. For more information visit the website at - http://groups.csail.mit.edu/cb/paircoil2/
References:
1. A.V. McDonnell, T. Jiang, A.E. Keating, B. Berger, "Paircoil2: Improved prediction of coiled coils from sequence", Bioinformatics Vol. 22(3) (2006).
2. B. Berger, D. B. Wilson, E. Wolf, T. Tonchev, M. Milla, and P. S. Kim, "Predicting Coiled Coils by Use of Pairwise Residue Correlations", Proceedings of the National Academy of Science USA, vol 92, aug 1995, pp. 8259-8263.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID; Other Ids (GI or EMBL or DDBJ: if available)
MKDRLAELLDLSKQYDPHEDIVTDHILESLYRDIRDIQDENQLLVA
-------------------------gabcdefgabcdefgabcdef
DVKRLGKQNARFLTSMRRLSSIKRDTNSIAKAFRARGEAAEAQHGP
gabcdefgabcdefgabcdef-------------------------
INASSTGLK
---------
10) Edge/Central Beta Strand DB
This database consists of a subset of small protein sequences with known 3D structure where the beta strands were identified as either edge (having only one side of the strand being hydrogen bonded) or central (having both sides of the strand hydrogen bonded). The identification of the strands and strand types was aided by VADAR [1] with final assignments being made through visual inspection of the 3D structure.
References:
1. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
----EEEEEEE-----CCCCCCCC-------CCCCCCCCC-----
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
--------------------EEEEEEEE-----------EEEEEE
INASSTGLK
E--------
11) Beta Hairpin DB
This database consists of a subset of small protein sequences with known 3D structure where beta hairpins have been accurately identified. Hairpins consist of two antiparallel beta strands separated by 4 or fewer residues. The identification of the beta hairpins was aided by VADAR [1] with final assignments being made through visual inspection of the 3D structure.
References:
1. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
----BBBBBBTTBBBBBBB--------------------------
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
--------------------BBBBBTTBBBB--------------
INASSTGLK
---------
12) Disulfide Bonds DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The disulfide pairing was determined using VADAR [4].
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDCGG
------1-----------------------------------2—
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPCNNAL
-----1—2---------------------------3----3----
INASSTGLK
---------
Where 1 pairs with 1, 2 pairs with 2, 3 pairs with 3. If a Cys doesn't participate in a disulfide bond then no number is assigned to it.
13) SPdb: A Signal Peptide (Eukaryote) DB
SPdb is a signal peptide database containing signal/leader sequences of archaea, prokaryotes and eukaryotes. This database currently is at release 4.0 and contains 22542 entries, of which 2748 are experimentally verified signal sequences (obtained by filtering the data, followed by manual curation where the mature endogenous proteins are sequenced on their N-terminal) and 19794 are unverified signal sequences. All sequences were derived from < a href="http://www.expasy.org/sprot">Swiss-Prot protein database (release 51.0) which is part of Uniprot. The nucleotide sequences were obtained from EMBL nucleotide database (release 88).
For more information visit the website at - http://proline.bic.nus.edu.sg/spdb/
References:
1. Choo KH, Tan TW, Ranganathan S. 2005. SPdb - a signal peptide database. BMC Bioinformatics 6:249.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
SSSSSSSSSSSSSSSSSS---------------------------
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
---------------------------------------------
INASSTGLK
---------
14) SPdb: A Signal Peptide (Gram +) DB
SPdb is a signal peptide database containing signal/leader sequences of archaea, prokaryotes and eukaryotes. This database currently is at release 4.0 and contains 22542 entries, of which 2748 are experimentally verified signal sequences (obtained by filtering the data, followed by manual curation where the mature endogenous proteins are sequenced on their N-terminal) and 19794 are unverified signal sequences. All sequences were derived from < a href="http://www.expasy.org/sprot">Swiss-Prot protein database (release 51.0) which is part of Uniprot. The nucleotide sequences were obtained from EMBL nucleotide database (release 88).
For more information visit the website at - http://proline.bic.nus.edu.sg/spdb/
References:
1. Choo KH, Tan TW, Ranganathan S. 2005. SPdb - a signal peptide database. BMC Bioinformatics 6:249.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
SSSSSSSSSSSSSSSSSS---------------------------
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
---------------------------------------------
INASSTGLK
---------
15) SPdb: A Signal Peptide (Gram -) DB
SPdb is a signal peptide database containing signal/leader sequences of archaea, prokaryotes and eukaryotes. This database currently is at release 4.0 and contains 22542 entries, of which 2748 are experimentally verified signal sequences (obtained by filtering the data, followed by manual curation where the mature endogenous proteins are sequenced on their N-terminal) and 19794 are unverified signal sequences. All sequences were derived from < a href="http://www.expasy.org/sprot">Swiss-Prot protein database (release 51.0) which is part of Uniprot. The nucleotide sequences were obtained from EMBL nucleotide database (release 88).
For more information visit the website at - http://proline.bic.nus.edu.sg/spdb/
References:
1. Choo KH, Tan TW, Ranganathan S. 2005. SPdb - a signal peptide database. BMC Bioinformatics 6:249.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
SSSSSSSSSSSSSSSSSS---------------------------
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
---------------------------------------------
INASSTGLK
---------
16) Signal Peptide (Eukaryote) DB
This database consists of sequences extracted from the SwissProt database [1] that have their signal peptides annotated and fully demarcated in the SwissProt data file. The signal peptide regions are marked below the sequence. The partitioning of sequences into Eukaryotes, gram- and gram+ve bacteria was done using the taxonomy identifiers in the SwissProt data fields.
References:
1. O'Donovan C, Martin MJ, Gattiker A, Gasteiger E, Bairoch A, Apweiler R. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform. 2002 Sep;3(3):275-84.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
SSSSSSSSSSSSSSSSSS---------------------------
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
---------------------------------------------
INASSTGLK
---------
17) Signal Peptide (Gram +) DB
This database consists of sequences extracted from the SwissProt database [1] that have their signal peptides annotated and fully demarcated in the SwissProt data file. The signal peptide regions are marked below the sequence. The partitioning of sequences into Eukaryotes, gram- and gram+ve bacteria was done using the taxonomy identifiers in the SwissProt data fields.
References:
1. O'Donovan C, Martin MJ, Gattiker A, Gasteiger E, Bairoch A, Apweiler R. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform. 2002 Sep;3(3):275-84.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
SSSSSSSSSSSSSSSSSS---------------------------
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
---------------------------------------------
INASSTGLK
---------
18) Signal Peptide (Gram -) DB
This database consists of sequences extracted from the SwissProt database [1] that have their signal peptides annotated and fully demarcated in the SwissProt data file. The signal peptide regions are marked below the sequence. The partitioning of sequences into Eukaryotes, gram- and gram+ve bacteria was done using the taxonomy identifiers in the SwissProt data fields.
References:
1. O'Donovan C, Martin MJ, Gattiker A, Gasteiger E, Bairoch A, Apweiler R. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform. 2002 Sep;3(3):275-84.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
SSSSSSSSSSSSSSSSSS---------------------------
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
---------------------------------------------
INASSTGLK
---------
19) Accessible Surface Area (%) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The fractional ASA values were extracted from the PDB files using VADAR [4]. The first column is the residue #, the second is the sequence and the third is the value of interest.
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name; SwissProt ID; PDB ID
1 A 0.83 2 N 0.34 3 S 0.05 4 D 0.02 5 E 0.16 6 H 0.67 7 M 0.56 8 G 0.34 9 G 0.12 10 A 0.43
20) Accessible Surface Area (Integerized) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The fractional accessible surface area (ASA) was determined using VADAR [4].
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS.
VADAR: a web server for quantitative evaluation of protein structure quality.
Nucleic Acids Res. 2003 Jul 1;31(13):3316-9.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
984328231023213987543211219384787432349999223
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
328321039485023942129383249857293842929112345
INASSTGLK
123923845
where
0=0-10% fractional ASA
1=10-20% fractional ASA
2=20-30% fractional ASA
etc. etc.
21) B Factor (Real value) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The B-factors were extracted from the PDB files directly. The first column is residue #, second is sequence and third is the value of interest.
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
Database Example Format File:
>Sequence name; SwissProt ID; PDB ID
1 A 12.8 2 N 14.5 3 S 15.7 4 D 23.9 5 E 21.2 6 H 20.9 7 M 15.6 8 G 11.3 9 G 9.1 10 A 11.6
22) B factor (Integerized Interval) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The B-factors were extracted from the PDB files directly and integerized by dividing the B-factors by 10 and rounding to the nearest integer.
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
984328231023213987543211219384787432349999223
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
328321039485023942129383249857293842929112345
INASSTGLK
123923845
where
0=0-0.10 B factor
1=0.11-0.20 B factor
2=0.21-0.30 B factor
etc. etc.
23) RMSF (Real value) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES
server [1] that were solved by NMR and for which multiple chain models exist. The RMSF values were determined from the PDB files using RMSF values calculated from SuperPose [2]. The first column is the residue #, the second is the sequence and the third is the value of interest.
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Maiti R, Van Domselaar GH, Zhang H, Wishart DS. SuperPose: a simple server for sophisticated structural superposition. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W590-4.
Database Example Format File:
>Sequence name; SwissProt ID; PDB ID
1 A 1.28 2 N 0.45 3 S 0.12 4 D 0.34 5 E 0.56 6 H 0.45 7 M 0.32 8 G 0.14 9 G 0.11 10 A 0.14
24) RMSF (Integerized Interval) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES
server [1] that were solved by NMR and for which multiple chain models exist. The RMSF values
were determined from the PDB files using RMSF values calculated from SuperPose [2] and integerized by multiplying the RMSF values by 10 and rounding to the nearest integer.
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Maiti R, Van Domselaar GH, Zhang H, Wishart DS. SuperPose: a simple server for sophisticated structural superposition. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W590-4.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
984328231023213987543211219384787432349999223
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
328321039485023942129383249857293842929112345
INASSTGLK
123923845
where
0=0-0.30 Angstroms RMSF
1=0.31-0.60 Angstroms RMSF
2=0.61-0.90 B Angstroms RMSF
etc. etc.
25) Order Parameter (Calculated) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The backbone (NH) order parameters (1-S2) were calculated using the method of Zhang and Bruschweiler [4]. The first column is the residue #, the second is the sequence and the third is the value of interest.
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Zhang F, Bruschweiler R. Contact model for the prediction of NMR N-H order parameters in globular proteins. J Am Chem Soc. 2002 Oct 30;124(43):12654-5.
Database Example Format File:
>Sequence name; SwissProt ID; PDB ID
1 A 0.83 2 N 0.34 3 S 0.05 4 D 0.02 5 E 0.16 6 H 0.67 7 M 0.56 8 G 0.34 9 G 0.12 10 A 0.43
26) Order Parameter (Integerized) DB
This database consists of a subset of non-redundant proteins in the PDB selected from the PISCES server [1] that exhibit less than 95% sequence identity and better than 3.0 Angstrom resolution (for X-ray structures). The data set includes both X-ray and NMR structures. The resulting set of proteins was further edited (manually) to remove proteins with transmembrane helices or transmembrane beta barrels. TMHMM [2], TMB-Hunt [3] and literature surveys were used in this culling process. The backbone (NH) order parameters (1-S2) were calculated using the method of Zhang and Bruschweiler [4] and integerized by multiplying the order parameters by 10 and rounding to the nearest integer.
References:
1. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003 Aug 12;19(12):1589-91.
2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
3. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W188-92.
4. Zhang F, Bruschweiler R. Contact model for the prediction of NMR N-H order parameters in globular proteins. J Am Chem Soc. 2002 Oct 30;124(43):12654-5.
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
984328231023213987543211219384787432349999223
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
328321039485023942129383249857293842929112345
INASSTGLK
123923845
where
0=0-0.10 Order Parameter
1=0.11-0.20 Order Parameter
2=0.21-0.30 Order Parameter
etc. etc.
27) Contact Order DB
This database contains relative and absolute contact
orders calculated for 15426 structures obtained from the PDB, where the contact order has been calculated using the method of Plaxco et al. [1], and the parameters for calculation are based on Ivankov et al. [2].
References:
1. Plaxco KW, Simons KT, Baker D. Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol. 1998 Apr 10;277(4):985-94.
2. Dmitry N. Ivankov, Sergiy O. Garbuzynskiy, Eric Alm, Kevin W. Plaxco, David Baker and Alexei V. Finkelstein.
Contact order revisited: Influence of protein size on the folding rate. Protein Science, 2003 12: 2057-2062
Database Example Format File:
>Sequence name to 25 characters; SwissProt ID; PDB ID; CO=12.1; Abs_CO=21.7
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
INASSTGLK
28) Folding Rate DB
This database consists of a set of proteins for which the folding rate (Kf) has been experimentally determined. The Kf values have been converted to natural logs. The folding rate values were obtained from several sources [1,2,3].
References:
1. Gromiha MM, Thangakani AM, Selvaraj S. FOLD-RATE: prediction of protein folding rates from amino acid sequence. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W70-4.
2. Ivankov DN, Garbuzynskiy SO, Alm E, Plaxco KW, Baker D, Finkelstein AV. Contact order revisited: influence of protein size on the folding rate. Protein Sci. 2003 Sep;12(9):2057-62.
3. Fulton KF, Bate MA, Faux NG, Mahmood K, Betts C, Buckle AM. Protein Folding Database (PFD 2.0): an online environment for the International Foldeomics Consortium. Nucleic Acids Res. 2007 Jan;35(Database issue):D304-7.
Database Example Format File:
>Sequence name; SwissProt ID; PDB ID; ln(k) = 25.1
MEILPQCDFKLGGAPRALDNQAGTRKEALKHLIAVNMAQPGDSGG
ANGLKCNMCSVLRIAGSSTHQNELANGAILVTQNGCLDIPANNAL
INASSTGLK
29) 3D Folding Decoys DB
This database contains 3D decoy models of selected protein targets generated using either Rosetta [1] or assembled from other sources [2-7].
Each data set contains between 50 and 2000 decoy structures in standard PDB file format as well as information about the RMSD between
the actual structure and each decoy.
Rosetta decoys. Rosetta decoys were generated locally using a default Rosetta protocol for creating folding decoys. [1] The average number of models in Rosetta decoy sets is 300. Each protein in Rosetta decoy set has a corresponding directory with a PDB file of native protein, decoy models, and {PDB ID}_rmsd.txt file with information about RMSD of each decoy with respect to the native structure.
Decoys-S decoys. DECOY-S (DECOYang-Shakhnovich) set includes decoys for 5 proteins (PDB IDs: 1ENH, 1GJS, 1E0G, 1IGD, 1CLB ). Decoys were generated by ab inito simulations in Prof. Shakhnovich's group [2]. Each protein in the decoys set is represented by 301 pdb files (1 native & 300 decoys) with RMSD from the native structure ranging from 0 to 20A. A text file (rmsd.txt) with information about RMSD is included in every decoy directory.
Decoys-R-Us decoys. Decoys-R-Us decoys [3] include three types of decoys '4state_reduced' [4], "fisa" [5] and "lattice_ssfit" [6,7] with decoys for 7, 4 and 8 proteins, respectively, and average number of decoys per set of 665, 1432 and 200, respectively. A text file (rmsds) with information about RMSD of decoys with respect to the native structure is included in every decoy directory.
References:
1. Rohl CA, Strauss CE, Misura KM, Baker D. Protein structure prediction using Rosetta. Methods Enzymol. 2004;383:66-93.
2.Yang JS, Chen WW, Skolnick J, Shakhnovich EI. All-atom ab initio folding of a diverse set of proteins. Structure. 2007 Jan;15(1):53-63.
3. Samudrala R, Levitt M. Decoys 'R' Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 2000 Jul;9(7):1399-401.
4. Park B, Levitt M. Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J Mol Biol. 1996 May 3;258(2):367-92
5. Simons KT, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions
J Mol Biol. 1997 Apr 25;268(1):209-25
6. Samudrala R, Xia Y, Levitt M, Huang ES. A combined approach for ab initio construction of low resolution protein tertiary structures from sequence.
Pac Symp Biocomput. 1999;:505-16.
7. Xia Y, Huang ES, Levitt M, Samudrala R. Ab initio construction of protein tertiary structures using a hierarchical approach.
J Mol Biol. 2000 Jun 30;300(1):171-85.
30) Amyloidogenic Proteins DB
This database consists of a subset of non-redundant proteins that
have been identified as having a propensity to form amyloid fibrils
and for which amyloidic regions have been identified through either
structural studies (CD, X-ray or NMR) of the whole protein or via
studies of peptide fragments. Many of the proteins were identified
through literature surveys [1,2,3].
References:
1. Trovato A, Seno F, Tosatto SC. The PASTA server for protein aggregation prediction. Protein Eng Des Sel. 2007 Oct;20(10):521-3.
2. Zhang Z, Chen H, Lai L. Identification of amyloid fibril-forming segments based on structure and residue-based statistical potential. Bioinformatics. 2007 Sep 1;23(17):2218-25.
3. Hamodrakas SJ, Liappa C, Iconomidou VA. Consensus prediction of amyloidogenic determinants in amyloid fibril-forming proteins. Int J Biol Macromol. 2007 Aug 1;41(3):295-300.
Database Example Format File:
>Sequence name; SwissProt ID; PDB ID; ln(k) = 25.1
>Sequence name to 25 characters; SwissProt ID; PDB ID
MLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQSSPGGNRYPPQSGGWGQPHGGGWGQP
------------------------------------------------------------
HGGGWGQPHGGGWGQPHGGGWGQGGGTHNQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGY
----------------------------------------*******************-
MLGSAMSRPLIHFGNDYEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNITIKQHT
-------------------------------------------------***********
VTTTTKGENFTETDVKIMERVVEQMCITQYEKESQAYYQRGSSMVLFSSPPVILLISFL
***--------------------------------------------------------
This project is supported by Genome Alberta & Genome Canada, a not-for-profit organization that is leading Canada's national genomics strategy with $600 million in funding from the federal government.