Secondary Structure Prediction

Solving protein structures is labor-intensive and expensive. Protein structure prediction is a useful alternate in spite of the fact that it is less accurate. It is possible to predict secondary structures quite accurately but tertiary structure prediction remains less accurate often requiring templates to model proteins.

Secondary Structure Prediction

The goal of secondary structure prediction is to assign or predict a secondary structure state (α, β, coil) given an amino acid sequence. We need to predict protein structures since experimental methods are very time consuming. In addition, it is not possible to solve all protein experimentally as explained in x-ray crystallography and NMR sections. Secondary structure prediction is the first step towards structure determination. It is usually followed by tertiary structure determination.

Protein structure prediction vs structure assignment

  • Secondary Structure Assignment: You know the structure and you deduce the secondary structure from this structure.
  • Secondary Structure Prediction: You don't know the structure and you deduce it from the sequence.

Software such as DSSP and stride assign secondary structures based on hydrogen bonding and backbone dihedral angles. Structure prediction uses a scoring system.

Secondary Structure Assignment

DSSP - Dictionary of Secondary Structure of Proteins
DSSP assigns secondary structures based solely on backbone-backbone H-bonds. The method defines an H-bond when the bond energy is below –0.5 kcal/mol from a Coulomb approximation of the H-bond energy. Assignments are defined such that visually appealing and unbroken structures result. There are 8 secondary structure classes:

  • H (α-helix)G (310-helix) --> H
  • I (p-helix) --> H
  • E (extended strand) --> E
  • B (residue in isolated b-bridge) --> E
  • T (turn) --> L
  • S (bend) --> L
  • " " (blank = other) --> L

STRIDE (secondary STRuctural IDEntification method)
STRIDE uses an empirically derived H-bond energy and phi-psi torsion angle criteria to assign secondary structures. Torsion angles are given alpha-helix and beta-sheet propensities according to how close they are to their regions in the Ramachandran plot. The parameters are optimized to mirror visual assignments made by crystallographers for a set of proteins.

Other methods
SECSTR: Same family of methods, developed specifically to improve the detection of p-helices
DEFINE, PSEA: Relies on Cα coordinates only
P-CURVE: Based on definition of helicoidal parameters
KAKSI: Based on Cα distances and torsion angles

There are several legitimate ways to define secondary structures. Different methods provide different assignments, especially at the edges of secondary structure segments. Percentage of agreement between DSSP, P-CURVE and DEFINE is only 63%. The resolution of structures appear to have moderate effect on assignments. The techniques used (X-ray vs NMR) has a more pronounced effect.

Signals for alpha helices

  • characteristic hydrophobicity profiles
  • prolines disrupt the middles of helices
  • period of 3.6
  • conserved hydrophobics at i, i+3, i+4, i+7

Signals for coils

  • gapped in multiple alignments
  • small polar residues (Ala, Gly, Ser Thr)
  • prolines rarer in other kinds of secondary structure

Structure Prediction

Accuracy of prediction is measure by the Q3 measure (per residue prediction accuracy) or SOV, segment overlap value (per segment prediction accuracy). Q3 give the percentage of correctly predicted residues in α, β and other states. SOV tells how the secondary structural elements have been predicted. It measures:

  • number of segments in proteins
  • average segment length
  • distribution of number of segments with length

First generation structure prediction tools were knowledge-based. They used single residue statistics, databases of limited size, and preferred particular residues for certain secondary structure elements. Overall, they have < 55% Q3 accuracy.

The second generation structure prediction tools use machine learning. They use larger database, and produce segment based statistics and take neighbors into account. The algorithms used statistical information, sequence patterns, neural networks, etc. ALB, COMBINE, and GORIII are such methods which have < 55% Q3 accuracy.

The third generation structure prediction tools use evolutionary information as well. PHD and PSIPRED have ~ 75% +/- 11 Q3 accuracy.

For globular, water-soluble proteins when MSA contains diverse sequences. There is occasional confusion between H and E. Most methods predict central regions better than caps

Secondary structure prediction tools are useful for

  • Chain tracing
  • Starting point for 3D structure modelling
  • Fold recognition
  • Homology modelling
  • Functional assignment

There are many secondary structure prediction tools such as mPredict, PSIpred, PREDATOR, etc.