Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

Structural Bioinformatics

Structural bioinformatics is a branch of science which concerns itself with the analysis and prediction of 3D structure of proteins in particular. Structural bioinformaticians use sequence data, sequence alignment data, NMR data and x-ray crystallographic data along with various visualization, modeling and prediction tools to analyse and predict the structure, function and behavior of their molecules of interest.

Importance of protein Structure

Proteins are the functional units of life. They are involved in everything from gene expression regulation to defense of an organism. Following is a list of most important functions:

Function Example
Catalysis Enzymes such as kinases enable reactions by lowering the activation energy (?G) of a reaction.
Structure Collagen
Transport Hemoglobin
Movement Actin and mysin in muscles
Regulation DNA polymerase
Defense Antibodies

Protein structure leads to its function

Proteins evolved under selective evolutionary pressure to carry out specific tasks. All these functions are defined largely due to interactions with other molecules. The way a protein interacts with molecules in its environment depends on its three dimensional fold. This fold refers to the overall shape, the surface, active sites, and positioning of key amino acids.

The structure of a protein defines what a protein can or cannot do. The distinctive amino acid sequence of proteins allow for the placement of particular chemical groups in specific places in specific places in 3D space. Even a minor modification such as changing one amino acid could change the structure of a protein significantly, thus modifying its function. For example, the sickle cell anemia disease results due to a hemoglobin where the sixth amino acid is changed from glutamic acid to valine. Protein structures are highly diverse and this diversity the functional diversity of these structures is expanded through interactions with smaller molecules.

Some common uses of protein structures are:

Why use structure data

Both sequences and structures are suitable candidates for predicting protein functions. However, sequence data is much more readily available than structure data. Structure data has an advantage over protein sequence data; structure data is far better conserved than sequence data over evolutionary time. Often coupling sequence data with structure data results in better predictions.

Apparently unrelated sequences can have similar structures. This indicates that the total number of protein folds should be much smaller than the total number of sequences. In fact, nature seems to be quite conservative in its choice of structures, preferring to conserver or slightly modify existing structures rather than invent new ones.

Structure-function relationships

Protein structure greatly simplifies the task of identifying protein function. However, there is no 1-1 relationship between protein structure and function. There are many proteins with similar structure but very different functions.

Quite different sequences can adopt the same structure. This fact can useful in identifying evolutionary relationships. It can, however, identify false relationships as well. Analogous proteins are proteins that have the same function but do not share ancestry. Homologous proteins share ancestry.

Amino acids

Proteins linear polymers built of amino acids. The sequence of the amino acids decide the 3 dimensional fold of the protein. The sequence of amino acid also defines the flexibility or rigidity of an protein. This structural property crucial to the protein functions. There are 20 commonly occurring amino acids in nature. A protein can contain any combination and number of the these 20 amino acids.

Proteins contain a wide range of functional groups such as thiols, alcohols, carboxamides and many more. The chemical activity of these functional groups is essential for the function of proteins.

Amino Acids

Amino acids are small molecules that contain:

The R group distinguishes one amino acid from another. Furthermore, the side chain is responsible for the specific chemical properties of the amino acid.

19 of the 20 common amino acids have a chiral α-carbon atom. Gly does not. Mirror image pairs of amino acids are designated by L (levo) and D (dextro). Proteins are assembled from L amino acids. Only a few D amino acids occur in nature. Almost all sugars have a D conformation. Threonine and isoleucine have 2 chiral carbons each, thus producing 4 possible stereoisomers each. Isomers depend on the position of the 4 group around the chiral center. Amino acids are L or D depending on the position of the amino group.

Properties of amino acids
Abbrev Amino Acid Polarity Acidity / Basicity
A Ala Alanine nonpolar neutral
C Cys Cysteine polar neutral
D Asp Aspartic Acid polar acidic
E Glu Glutamic acid polar acidic
F Phe Phenylalanine nonpolar neutral
G Gly Glycine nonpolar neutral
H His Histidine polar weakly basic
I Ile Isoleucine nonpolar neutral
K Lys Lysine polar basic
L Leu Leucine nonpolar neutral
M Met Methionine nonpolar neutral
N Asn Asparagine polar neutral
P Pro Proline nonpolar neutral
Q Gln Glutamine polar neutral
R Arg Arginine polar strongly basic
S Ser Serine polar neutral
T Thr Threonine polar neutral
V Val Valine nonpolar neutral
W Trp Tryptophan nonpolar neutral
Y Tyr Tyrosine polar neutral

Classification of Amino acids

Based on the table above, the amino acids can be classified into several groups.

Charged Amino acids

Charged amino acids are overall uncharged but have even charge distribution.

Acidic / negatively charged amino acids: ED
Basic / positively charged amino acids: KRH

Hydrophobic amino acids - FW MAIL PGV

Polar - SCN WYTH

Polar amino acids are overall uncharged but uneven charge distribution.

Memorizing amino acids

Memorizing the 20 amino acids is among the most undesirable things a life students has to do. All biology, biochemistry and other life sciences students are required to 'know' the amino acids for one of more exams. This knowledge comes in very useful even after you have succeeded in your exam. Following is what you need to know about amino acids:

The best way to memorize is to associate something you would like to retain with something you already know. In addition, committing things to your memory gradually over time is a better way to memorize than trying to memorize everything all at once. The best approach to memorizing amino acids is to use logic, name recognition and similarities rather than differences. Learning amino acids in detail now would allow you to acquire a better understanding of proteins structure later on.

Memorizing abbreviations

The following tips were devised by my biochemistry 101 buddies and myself. They seem rather awkward but they help us avoid confusion to this day. It is amazing as to how often, people confuse the amino acids.

Memorizing amino acid classifications
Memorizing amino acid structures

Our strategy is to memorize one structure and use it to infer another. So let's begin with the easier one. Glycine.

           COOH     Glycine
         α |        
  +H3N --- C --- H
           |
           H

When we replace the H with a methyl group, we get Alanine.

           COOH     Alanine
         α |        
  +H3N --- C --- H
           |
           CH3

When we add a phenyl group to alanine, we get phenylalanine.

           COOH     Phenylalanine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
          / \
         | O |
          \ /

When we add a hydroxyl group to Phe, we get Tyrosine.

           COOH     Tyrosine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
          / \
         | O |
          \ /
           |
           OH

There are only 2 acidic amino acids. To be acidic, they must have a negatively charged R side chain. This negative charge is provided by COO-. Aspartic acid is formed by adding a carboxyl ion to Alanine:

           COOH     Aspartic Acid
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           COO-

Glutamic acid is formed by inserting another CH2 into Aspartic acid:

           COOH     Glutamic Acid
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           CH2
           |
           COO-

In un-ionized form, glutamic acid and aspartic acids are called glutamate and aspartate. Glutamine and Asparagine are amide derivatives of Glu and Asp.

           COOH     Asparagine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           C=O
           |
           NH2

           COOH     Glutamine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           CH2
           |
           C=O
           |
           NH2

All three basic amino acids (KRH) have a positive charge on the nitrogen in the R side chain.

           COOH     Lysine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           CH2
           |
           CH2
           |
           CH2
           |
           NH3+

           COOH     Arginine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           CH2
           |
           CH2
           |
           NH
          /
    +H2N=C
          \
           NH2

           COOH     Histidine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           Imidazole

Serine is formed by adding a hydroxyl group to Alanine.

           COOH     Serine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           OH

Threonine is formed by adding a CH3 to Serine.

           COOH     Threonine
         α |        
  +H3N --- C --- H
           |
         H-C-OH
           |
           CH3

Cysteine is formed by replacing the O with S in Serine.

           COOH     Cysteine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           SH

           COOH     Methionine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           CH2
           |
           S
           |
           CH3

Valine has a V shaped side chain

           COOH     Valine
         α |        
  +H3N --- C --- H
           |
           CH
          / \
        CH3   CH3

Leucine has a Y shaped side chain.

           COOH     Leucine
         α |        
  +H3N --- C --- H
           |
           CH2
           |
           CH
          / \
        CH3   CH3

Isoleucine has a upside down L-shaped side chain.

           COOH     Isoleucine
         α |        
  +H3N --- C --- H
           |
         H-C-CH3
           |
           CH2
           |
           CH3

Proline is shaped lika a pentagon with the amino group incorporated in the ring.

          COOH     Proline
        α |        
   HN --- C --- H
    |     |
    CH2   CH2
     \   /
      CH2

Stereochemistry

19 of the 20 common amino acids have a chiral alpha-carbon atom. Gly does not. Mirror image pairs of amino acids are designated by L (levo) and D (dextro). Proteins are assembled from L amino acids. Only a few D amino acids occur in nature. Threonine and isoleucine have 2 chiral carbons each, thus producing 4 possible stereoisomers each.

Isomers depend on the position of the 4 group around the chiral center. Amino acids are L or D depending on the position of the amino group.

           COOH               COOH     
         α |                α |       
  +H3N --- C --- H      H --- C --- NH3+
           |                  |
           CH3                CH3
      L-Alanine            D-Alanine

pH & pKa

pKa values of amino acid side chains play an important role in defining the pH-dependent characteristics of a protein. The pH-dependence of the activity displayed by enzymes and the pH-dependence of protein stability, for example, are properties that are determined by the pKa values of amino acid side chains.

Amino acids have weak acid and weak basic character. Low pH transition leads to titration of the carboxylic group. High pH transition leads to titration of the amino group. pI is a point at which there is no net charge. The Zwitterionic form is the most populated form at neutral pH.

At pKa, fully half of the molecules are protonated and the other half are deprotonated.

Protein Structures

There are 4 levels of protein structures:

Primary Structure


A primary structure is a sequence of amino acids that are chemically bound together covalently. The amino acid sequence is responsible for the unique characteristics of every protein. Peptides are read from the amino terminus to the carboxyl terminal.

Amino acids bond together by peptide bonds. Peptide bonds are formed between the amino and carboxyl terminals and result in the release of a water molecule. Peptide bonds are usually formed in trans formation to prevent crowding. Gly is an important exception to this rule. The six atoms on the peptide group are co-planar.

The peptide bond is rigid and planar. Therefore, the polypeptide chain can only rotate about the bonds formed by C-α. These bonds have been termed the Phi φ and Psi ψ angles. The rotational freedom about φ and ψ angles is limited by steric hindrance between the side chains of the residues and the peptide backbone. Consequently, the possible conformations of a given polypeptide chain are quite limited.

cis & trans isomers

Dihedral angles


Dihedral angles are defined by 3 vectors, 4 atoms:

Look along the appropriate bond, phi or psi "from the N terminal end".
A rotation of the bonds/atoms connected to the further end (C terminal end) in a "clockwise" sense is a positive rotation.
An anticlockwise rotation is negative

Peptide bond planes can rotate relative to each other:

Ramachandran Plot

A Ramachandran Plot is a plot of φ vs. ψ angles. It maps the entire conformational space of a polypeptide and illuminates the allowed and disallowed conformations. Different amino acids have different preferences of φ-ψ angles.

Some key exceptions to these conformational limitations can be attributed to glycine and proline. The single H side chain of Glycine greatly reduces steric hindrance and expands the possible conformational space. The cyclic bond present in proline reduces the conformational space.

The nature of protein sequence and composition reflects its function. Membrane proteins have more hydrophobic residues. Homologous proteins often have similar sequences. Sequence similarity often implies similar secondary and tertiary structures.

Secondary Structure

The secondary structure refers to certain repetitive conformations in short sections of the peptide backbone. It can be thought of as the local conformation of the polypeptide chain, independent of the rest of the protein. The limitations imposed by peptide bonds and hydrogen bonding considerations dictate the secondary structure.

Some of the more commonly occuring secondary structures are:

φ and ψ conformations are specific and repetitive in α helices and β sheets. Conformations can be random in coils and loops.

α Helix

A helix is created by the curving of a polypeptide chain. The chain can coil to the right or the left. Almost all helices coil to the right. An α helix has 3.6 residues per turn. The structure is stabilized by N-H and C=O or the peptide group. Other helices such as π helix which has 4.4 residues per turn have and 310 helix which has 3 residues per turn have been observed in nature. However, such helices are rare.

R groups extend radially from the α helix core. The choice of residues extending from the helix make the helix polar, hydrophobic or amphipathic.

Different amino-acid sequences have different propensities for forming α helical structure. Methionine, alanine, leucine, glutamate, and lysine "MALEK" all have especially high helix-forming propensities. Proline tends to break or kink helices because it cannot donate an amide hydrogen bond (having no amide hydrogen), and because its sidechain interferes sterically; its ring structure also restricts its backbone. However, proline is often seen as the first residue of a helix, presumably due to its structural rigidity. At the other extreme, glycine also tends to disrupt helices because its high conformational flexibility makes it entropically expensive to adopt the relatively constrained α helical structure.

β Sheets


Unlike helices, β sheets are formed by hydrogen bonds between adjacent polypeptide chains rather than within a single chain. Sections of the polypeptide chain participating in the sheet are called β strands. β strands are formed by rotating φ and ψ approximately 180 degrees with respect to each other. Thus the peptide chains are fully extended and pleated because the adjacent peptides cannot be coplanar. β sheets are stabilized by interchaining hydrogen bonds between N-H and C=O.

β sheets can be either parallel or antiparallel. In a parallel configuration, both strands are lined up in the same direction e.g. C-terminal to N-terminal. In antiparallel configuration, one strand is lined N-terminal to C-terminal while the other is lined C-terminal to N-terminal. Parallel configuration does not have optimal hydrogen bond formation. Consequently, parallel configuration is less stable than antiparallel configuration. Mixed β sheets have both parallel and antiparallel configurations.

Secondary Structural Elements

Although, α helix and β sheets are the most predominant secondary structures, several irregular structures such as turns, loops, and coil are found in nature. Loops are usually present at the surface of the protein, often transitions between regular structures. Loops often act as an active site, e.g. on antibodies. Structurally speaking, turns and loops allow compaction of the protein.

Turns occur when there is a reversal in orientation of the main chain. They are stabilized by hydrogen bonds bridging across the interior of the turn. Loops are located on the surface of the protein. They connect two antiparallel strands.

Amino acids in α helices and β sheets show different geometries as can be seen in the ramachandran plot.

Right-handed α helix:

Left-handed α helix:

β strand

Tertiary Structure

Tertiary structure is the 3D fold of the polypeptide structure in space. Motifs are limited number of secondary structure elements combined into simple folds. Domains are several motifs packed in a specific, compact arrangement that in many cases can fold as an independent unit. Large proteins consist of multiple domains connected by flexible segments of the peptide chain

Some general tertiary structures include:

The tertiary structure is stabilized by weak noncovalent bonds such as van der waals forces, hydrophobic interactions, hydrogen bonds and salt bridges. Stronger convalent forces such as disulfide bonds and metal coordination also contribute to the overall stability.

Domains and Motifs

Domains are compact sections of the protein that represent structurally and functionally independent regions of a protein. This is to say that a domain is a subsection of the protein which would maintain its characteristic structure, even if separated from the overall protein. Motifs are substructures formed form a few secondary structures. Motifs are usually not structurally independent. They can be considered to be the minimal functional units of a protein. e.g.

Quaternary Structure

Quaternary structure is the association of multiple subunits (identical or different), each with a tertiary structure and each a unique gene product. Subunits are held together by many weak, noncovalent interactions (hydrophobic, electrostatic). Symmetry controls both structure and function of a quaternary structure.

X-ray crystallography

X-ray crystallography is a technique which is widely used to determine structures of proteins. X-ray crystallography exploits the fact that X-rays are scattered or diffracted in a predictable manner when they pass through a protein crystal. X-rays are diffracted when they encounter electrons, so the nature of the scattering depends on the number of electrons that are present in each atom and the organization of the atoms in space. Diffracted X-rays can positively or negatively interfere with each other. Therefore, when protein molecules are regularly arranged in a crystal, the interaction between X-rays scattered in the same direction generating an interpretable pattern of spots. The crystal essentially amplifies the diffraction signal. The generated diffraction patterns are used to build a 3D image of the electron clouds of the molecule. This is known as an electron density map. The structural model of the protein is built within this electron density map. [2]

Why do we need x-ray crystallography

The function of a protein depends on its structure. Therefore determining protein structure accurately and determining reliable answers to structure related questions is crucial. Knowledge of accurate molecular structures is a prerequisite for drug design. X-ray crystallography is the oldest and most widely used technique to determine protein structure.

X-ray crystallography produces high resolution and has no protein mass limit since it provides atomic level resolution. However, it requires protein crystals which can be very difficult to produce. X-ray crystallography provides static average of the protein structure. Hydrogen is hardly visible through x-ray crystallography. Accuracy depends heavily on the quality of the crystal structure.

Alternate Methods

NMR and electron microscopy can be considered to be alternatives to x-ray crystallography. X-ray crystallography and NMR produces atomic level resolution. Electron microscopy offers molecular resolution. NMR is more difficult to use and much more expensive.

The method's principle

According to method's principle, LR (the limit of resolution) depends on the wavelength you are using.

LR = λ/2

Electon microscope uses a wavelength, 400 nm < λ < 800 nm. Thus, the best value for LR can be 400/2 = 200 nm, which is suitable for viewing organelle structures but not protein structures.

X-ray has a wavelength of 100 Å < λ < 0.1 Å (10 nm < λ < 0.01 nm). Thus the LR is less than the distance between two atoms, 1.2Å. Therefore, x-rays can be used at atomic level.

X-rays are formed by collision of fast electrons with matter. The wavelength of the generated x-rays depend on the matter with which the x-rays collided. Monochromatic x-rays are used to solve smaller molecules.

Workflow

  1. protein expression
  2. protein purification
  3. crystal production
  4. x-ray diffraction & phasing
  5. data collection - analysis of diffraction patterns
  6. model construction - I = A2 (intensity = square of amplitude)

Crystallization


A crystal is a solid formed by ordered atoms and ions. Ordered means that the same pattern is repeated along a regular lattice. Crystals are necessary since diffraction from individuals is too weak to measure. Crystals act as amplifiers by increasing the scattering signal since they contain a collection of same molecules ordered in a similar fashion.

Accurate structure determination requires a well-ordered crystal that diffracts X-rays strongly. Hydrophobic proteins or proteins with hydrophobic domains are the most difficult to crystallize.

Crystal formation is a multimetric process. It involves three steps:

  1. Nucleation
  2. Growth
  3. Cessation of growth

Crystallization is nothing more and nothing less than forcing a protein to precipitate into regularly ordered three dimensional arrays. These 3D arrays are the crystal.

Protein Solubility

Proteins are placed in solution with salts (precipitants). The solubility curve is a representation of protein solubility. Saturation occurs when the rate of loss and gain of both the solid and solution phases of the protein are equal, and the system is in equilibrium. Salting-out occurs when there is a reduction in protein solubility as the concentration of salt increases. Salting-in can be seen on when there is an increase in protein solubility as the concentration of salt increases. Nucleation occurs in the labile zone. Crystal growth occurs in the metastable zone. The goal is not to precipitate the protein, but to keep it is the labile and metastable zones until crystals are formed. The probability of nucleation increases with increasing supersaturation.

Crystallization energy barrier


There is a crystallization energy barrier which must be overcome by proteins before they can crystallize. The critical nucleus corresponds to the higher energy intermediate. The higher the energy barrier, the slower the rate of nucleation.

Saturation Zones

The probability of nucleation increases with increasing supersaturation. Supersaturation increases the likelihood that a critical nucleus will form. In addition, smaller nucleus is needed to induce crystal formation in a supersaturated complex. See phase diagram above. Saturation increases as we go from left to right.

Crystallization experiment issues

A protein solution is mixed with precipitating reagents such as NH4PO3 to induce nucleation and subsequently crystal growth. The choice of the precipitating agent is important since no on reagent is compatible with all proteins.

Crystallization is affected by several parameters such as:

Purity and homogenity of the macromolecules is very important. Purity refers to lack of contaminants. Homogenity refers to lack of both conformational heterogeneity and sequence heterogeneity. Conformational heterogeneity refers to flexible domains and denaturations. Sequence heterogeneity refers to PTMs and proteolytic fragmentation. To reduce conformational heterogeneity, it is common to block a flexible domain with a ligand or chop it off.

There are several ways to detect contamination and heterogenity such as gel electrophoresis, immunological titrations, etc.

In general, the more you know about your protein, the more likely it would be for you to crystallize your protein. Homogeneous, compact, and globular proteins are more likely to be crystallized than hetergeneous and non-globular ones.

Solubility space
Crystallization experiments depend on large amounts of pure, soluble protein. However, it is difficult to obtain and purify large amounts of a rare protein. A sparse matrix allows rapid sampling of solubility space. It is a matrix of buffers which is used to make crude extracts that are rapidly assayed for the soluble protein using gel electrophoresis. A sparse matrix refers to a system which loosely couples different entities.

Sparse Matrix
A sparse matrix is matrix of buffers. The matrix is used to quickly estimate the best buffer for a given protein. Based on this technology, several commercial vendors supply screens which automate crystal growth. They range from same to similar products.

Growing Crystals

There are several setups which allow crystal growth. Most use a variant of the vapor diffusion method. The most widely used method is the hanging drop method.

Vapor Diffusion
A few microliters of protein solution are mixed with an about equal amount of reservoir solution containing the precipitants. A drop of this mixture is put on a glass slide which covers the reservoir. As the protein/precipitant mixture in the drop is less concentrated than the reservoir solution (we mixed the protein solution with the reservoir solution about 1:1), water evaporates from the drop into the reservoir. As a result the concentration of both protein and precipitant in the drop slowly increases, and crystals may form. There is a variety of other techniques available such as sitting drops, dialysis buttons, and gel and microbatch techniques.

Cat whisker streaking is the preferred method of seeding. Touch a crystal with a whisker, seeds will be dislodged by friction.

Alternate methods for growing crystals

Is it a crystal?

So we have a crystal. Is it a salt crystal or a protein crystal. One way to find out is to set up a no protein control drop. An identical experiment but with a drop which doesn't contain a protein. Another way is to run the crystal on a gel. With this method, you lose your crystal. The most definitive way to tell your crystals apart from salt is by testing the diffraction properties.

Don't judge the crystal by its looks. A nice crystal may not diffract and an ugly crystal may diffract at high resolution.

When experiment fails

If you fail to crystallize, try the following:

Solving the crystal

If an asymmetric unit contains 8 monomer, then the unit is composed of 8 amino acids.

Space Groups

Crystals and lattices can be classified into several space group based on how they favor filling space in a crystal lattice.

The combination of 14 Bravais lattices with 32 point groups and additional translational components such as screw angles and glide planes give a total of 230 groups. Of these only 65 space groups without mirror planes and inversion centers are possible for protein crystals.

X-rays can diffract with both constructive and destructive interference. Constructive inference is when the wavelength travel in unison while destructive interference is the opposite. By unison, we are referring to having the same amplitude and phase.

The pattern of diffraction allows direct determination of the unit cell and geometry (space group).

The resolution is calculated by: dmin = λ / 2 sin θ. A resolution of 2 or less angstroms is considered high resolution. Anything close to 6 angstroms is considered low resolution.

Phase Problem

Once we have acquired the diffraction patterns, we need to calculate and electron density map from the diffraction patterns. The process requires three pieces of information:

  1. wavelength λ of the incident x-rays - this is already known
  2. amplitude of the scattered x-rays - this can be determined by the intensity of the reflections
  3. phase of diffraction - this is not known and cannot be determined from the pattern of reflections.

Further experiments are usually necessary to determine diffraction phases. The standard approach is to produce heavy atom-containing isomorphous crystals. These crystals have the same structure but would produce alternative diffraction patterns. This is achieved by soaking the protein crystals into heavy metal salt solution so that the heavy metal atoms diffuse into spaces originally occupied by the solvent. By comparing the reflections generated by several different isomorphous crystals (MIR - multiple isomorphous replacement) the positions of the heavy atoms can be worked out and this allows the phase diffraction in the unsubstituted crystal to be deduced. [2]

Using the MIR process we we acquire:

The phase of the protein can then be estimated from these three amplitudes and one phase. The phase information is then used to construct an electron density map by means of a Fourier transform.

Finally, a structural model is built into the electron density map. This requires one more crucial piece of information - the amino acid sequence - because C, O, N atoms cannot be distinguished with certainty by x-ray diffraction so amino acid side chains are difficult to identify.

Accessing quality of an x-ray structure

To access the quality of an x-ray structure, we evalute:

A well refined crystal structure should have:

Source

[1] Lecture slides Dr. Leonardo Scapozza, University of Geneva
[2] Principles of Proteomics by R. M. Twyman
[3] Structural Bioinformatics by Bourne & Weissig

NMR - Nuclear Magnetic Resonance

Nuclear magnetic resonance is a phenomenon that occurs because some atomic nuclei have magnetic properties. In NMR, these properties are utilized to obtain chemical information. Subatomic particles can be thought of as spinning on their axes, and in many atoms these spins balance each other such that the nucleus itself has no overall spin. However, these spins do not balance out in 1H, 13C, 15N, 19F and 31P. Such nuclei can have one of two possible half spins both of which have the same energy.

Nuclei with half spin behave like a magnet. When placed in constant magnetic field, they tend to align themselves with the field. When placed in a magnetic field, the energy level splits as in one orientation the nucleus aligns itself with the magnetic field while it cannot align itself if it is in another orientation. Where such energy separations exist, nuclei can be induced to jump from the lower-energy magnetic spin state to the less favorable higher-energy state when exposed to radio waves of a certain frequency. This absorption is called resonance because the frequency of the radio waves coincides with the frequency at which the nucleus spins. When the nuclei flip back to their original orientations, they emit radio waves that can be measured. Protons 1H give the strongest signals, and this is the basis of protein structural analysis by NMR spectroscopy.

To create this a constant magnetic field, NMR spectrometers contain superconducting magnets. Superconducting magnets have no resistance and no current loss. They require cooling to almost absolute zero. In brief, NMR is very expensive.

The energy input to make the nuclei resonate is produced by radio frequency (RF) pulse. Different effects are measured based on the length of the pulses and delay between the pulses. Thus an NMR spectra can generate a large variety of spectra. Depending on the frequency of the RF pulse, different nuclei can be detected. After the pulse, the nuclei return to their ground energy state. The nuclei precess back to their start position and this precessing induces a current which is detected by a coil in the NMR spectrometer. This is act of returning back to ground state is call Free Induction Decay (FID).

A free induction decay (FID) is the observable NMR signal generated by non-equilibrium nuclear spin magnetisation precessing about the magnetic field (conventionally along z). This non-equilibrium magnetisation is generally created by applying a pulse of resonant radio-frequency close to the Larmor frequency of the nuclear spins. [5]

AN NMR spectrum is a superposition of signals, one FID per signal. Fourier transforms are used to transform FID from time domain to frequency domain.

Step in NMR spectroscopy

  1. NMR experiments
  2. data collection
  3. spectrum assignment
  4. structure calculation

NMR spectroscopy is used to determine the structures of proteins in solution, and this requires proteins to be both highly soluble and stable.

One dimensional NMR experiments can detect chemical shifts and other shielding effects such as spin-spin coupling. One dimensional NMR is generally insufficient to characterize complex molecules like proteins. Therefore, instead of using a single radio pulse, a sequence of pulses are used separated by different time intervals which give a two dimensional NMR spectrum with additional peaks indicating pairs of interacting nuclei. Three types of interactions can be measured by using different pulse sequences.

COSY - correlation spectroscopy
It detects sets of protons interacting through bonds, i.e. protons linked to adjacent bonded pairs of C and N atoms allowing us to trace a network of protons linked to bonded atoms.

TOCSY - total correlation spectroscopy
It detects groups of protons interacting through a coupled network, not just those joined to adjacent bonded pairs of C or N atoms. TOCSY can often identify all the protons associated with a particular amino acid, but cannot spread to adjacent residues because there are no protons in the carboxyl portion of the peptide bond.

NOESY - nuclear overhauser effect spectroscopy
It takes advantage of the nuclear overhauser effect i.e. signals produced by magnetic interactions between nuclei that are close together in space but not associated by bonds. This is useful for determining protein structures because interactions can be identified between protons that are widely separated along the polypeptide backbone but close together in space due to the way in which the protein folds.

When these effects are taken into account, the result of NMR analysis is a set of distance constraints, which are estimated distances between particular pairs of atoms (either bonded or unbonded). If enough distance constraints are calculated, the number of protein structures that fit the data becomes finite. Thus NMR analysis produces 10-50 models instead of a unique structure. Good NMR resonance depends on the protein molecule tumbling rapidly in the solvent, which limits the size of proteins that can be analyzed to those with fewer than 300 residues.

Distance geometry, simulated annealing, and torsion angle dynamics are used to calculate structures.

X-ray crystallography tends to produce more accurate models than NMR although where both methods have been applied to the same protein there appears to be excellent agreement in the structures. This is probably because protein crystals have large water contents and thus exist in similar state to dissolved proteins. An important advantage of NMR is that it is possible to measure the dynamics of each residue with this method and it can therefore distinguish between regions of the protein that vibrate and those that are disordered. NMR also provides positions of many hydrogen atoms, which is not possible with x-ray crystallography.

Advantages

Disadvantages

NMR spectroscopy is used routinely in high-throughput screens to determine protein:ligand interactions.[2][3] NMR is also a key tool in mechanistic enzymology and in studies of protein folding and stability.

Source

[1] Principles of Proteomics by R. M. Twyman
[2] Hajduk et al., 1999
[3] Shuker et al., 1996
[4] Structural Bioinformatics by Bourne & Weissig
[5] Wikipedia.org

Secondary Structure Prediction

Solving protein structures is labor-intensive and expensive. Protein structure prediction is a useful alternate in spite of the fact that it is less accurate. It is possible to predict secondary structures quite accurately but tertiary structure prediction remains less accurate often requiring templates to model proteins.

Secondary Structure Prediction

The goal of secondary structure prediction is to assign or predict a secondary structure state (α, β, coil) given an amino acid sequence. We need to predict protein structures since experimental methods are very time consuming. In addition, it is not possible to solve all protein experimentally as explained in x-ray crystallography and NMR sections. Secondary structure prediction is the first step towards structure determination. It is usually followed by tertiary structure determination.

Protein structure prediction vs structure assignment

Software such as DSSP and stride assign secondary structures based on hydrogen bonding and backbone dihedral angles. Structure prediction uses a scoring system.

Secondary Structure Assignment

DSSP - Dictionary of Secondary Structure of Proteins
DSSP assigns secondary structures based solely on backbone-backbone H-bonds. The method defines an H-bond when the bond energy is below –0.5 kcal/mol from a Coulomb approximation of the H-bond energy. Assignments are defined such that visually appealing and unbroken structures result. There are 8 secondary structure classes:

STRIDE (secondary STRuctural IDEntification method)
STRIDE uses an empirically derived H-bond energy and phi-psi torsion angle criteria to assign secondary structures. Torsion angles are given alpha-helix and beta-sheet propensities according to how close they are to their regions in the Ramachandran plot. The parameters are optimized to mirror visual assignments made by crystallographers for a set of proteins.

Other methods
SECSTR: Same family of methods, developed specifically to improve the detection of p-helices
DEFINE, PSEA: Relies on Cα coordinates only
P-CURVE: Based on definition of helicoidal parameters
KAKSI: Based on Cα distances and torsion angles

There are several legitimate ways to define secondary structures. Different methods provide different assignments, especially at the edges of secondary structure segments. Percentage of agreement between DSSP, P-CURVE and DEFINE is only 63%. The resolution of structures appear to have moderate effect on assignments. The techniques used (X-ray vs NMR) has a more pronounced effect.

Signals for alpha helices

Signals for coils

Structure Prediction

Accuracy of prediction is measure by the Q3 measure (per residue prediction accuracy) or SOV, segment overlap value (per segment prediction accuracy). Q3 give the percentage of correctly predicted residues in α, β and other states. SOV tells how the secondary structural elements have been predicted. It measures:

First generation structure prediction tools were knowledge-based. They used single residue statistics, databases of limited size, and preferred particular residues for certain secondary structure elements. Overall, they have < 55% Q3 accuracy.

The second generation structure prediction tools use machine learning. They use larger database, and produce segment based statistics and take neighbors into account. The algorithms used statistical information, sequence patterns, neural networks, etc. ALB, COMBINE, and GORIII are such methods which have < 55% Q3 accuracy.

The third generation structure prediction tools use evolutionary information as well. PHD and PSIPRED have ~ 75% +/- 11 Q3 accuracy.

For globular, water-soluble proteins when MSA contains diverse sequences. There is occasional confusion between H and E. Most methods predict central regions better than caps

Secondary structure prediction tools are useful for

There are many secondary structure prediction tools such as mPredict, PSIpred, PREDATOR, etc.

Homology Modeling

There are three principle methods for predicting 3D structure of a protein:

Homology Modeling


The ultimate goal of protein modeling is to predict a structure from its sequence with a accuracy that is comparable to the best results achieved experimentally. [2] Homology modeling is also referred to as comparative protein modeling or knowledge-based modeling. The idea behind homology modeling is to use experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target). Homology modeling is based on two observations:

How accurate or reliable is homology modeling?
Homology models are classified into 3 area in terms of their accuracy and reliability.

How good can homology modeling be?

Homology modeling involves following steps:

Template Selection

In the safe homology modeling zone, the percentage identity between the sequence of interest and a possible template is high enough to be detected with simple sequence alignment programs such as BLAST. To identify hits, the program compares the target sequence to all the sequences of known structures in the PDB. [2] This gives us a probable set of templates. We choose the final template after finding structurally conserved regions among templates. Once a suitable template is found, we look into the PubMed database for the relevant fold to determine its biological role. We evaluate whether the biological/biochemical function of the proteins match. We also pay attention to resolution, experimental methods used, experimental conditions such as pH, ligands, cofactors, and the protein's family.

Similar sequence does not always imply similar structure. Identical sequence does not always imply identical structure.

Alignment

There are three principal techniques for alignment:

The quality of the sequence alignment is of crucial importance. No current comparative modeling method can recover from an incorrect alignment. Misplaced gaps, representing insertions or deletions, will cause residues to be misplaced in space. Careful inspection and adjustment on Automatic alignment may improve the quality of the modeling.

Model Building

There are three ways to build a protein model:

Template based fragment assembly

Swiss modeler can be used for assembling fragments. This method involves assembling rigid fragments from homologous proteins of known structure. First we find structurally conserved core regions and construct an averaged backbone of all templates to build a model core. Then we model loops and side chains.

backbone generation
When the alignment is ready, the actual modeling can start. Creating the backbone is trivial for most of the model. Simply copy the coordinates of those residues that show up in the alignment with the model sequence. If two aligned residues differ, only the backbone coordinates can be copied. If they are the same, side chains can also be included. [2]

loop modeling
In majority of cases, the alignment between model and template sequence contains gaps. Either gaps in the model sequence or in the template sequence. If there are gaps in the model sequence, we can simply omit residues from the template, creating a hole in the model that must be filled. If there are gaps in the template sequence, we take continuous backbone from the template, cut it, and insert the missing residues. Both cases imply a conformational change of the backbone. Conformational changes cannot happen within regular secondary structures, meaning that they must be in loops or turns. Predicting loops and turns is very difficult. [2]

There are two main methods to predict loops:

  1. Knowledge based: search for known loops with endpoints that match our residues from PDB and simply copy the loop conformation
  2. Energy based: use an energy function to judge the quality of the loop

Reliable loops can be built for up to 5-8 residues.

side-chain modeling
To model chain we look for the most probable side chain conformation, using:

When we compare the side-chain conformations (rotamers) of residues that are conserved in structurally similar proteins, we find that they often have similar torsion angles about Cα-Cβ bond. It is therefore possible to simply copy conserved residues from the template to the target. In practice, this is accurate only at high levels of identity. [2]

Only a small fraction of all possible side chain conformations is observed in experimental structure. This significantly reduces the complexity of the modeling problem. Rotamer libraries provide an ensemble of likely conformations. The propensity of rotamers depends on the backbone geometry. Side chain modeling depends heavily on rotamer libraries.

Energy minimization
Modeling often produces unfavorable bond lengths, bond angles, torsion angles and contacts. Therefore, it is import to minimize energy to regularize local bond and angle geometry and to relax close contacts and geometric chain. It must however be noted that extensive energy minimization moves coordinates away from real structure. It is therefore prudent to keep energy minimization steps to a minimum

Satisfaction of spatial constraints

In this method, we

Restraints are distances, angles, dihedral angles, pairs of dihedral angles and some other spatial features defined by atoms. Spatial restraints can be obtained from .

Spatial restraints be obtained from:

Models Evaluation

Errors in homology modeling generate from:

Source

[1] http://bmc.ub.uni-potsdam.de/1475-2859-4-20/
[2] Structural Bioinformatics by Bourne & Weisseg
[3] Lecture notes of Dr. Lina Yip

Ab Initio Modeling

Ab initio structure prediction seeks to predict the native conformation of a protein from amino acid sequence alone. Comparative modeling depends on finding a suitable template structure. In the absence of a suitable structure, ab initio prediction is the only method. A typical procedure would be to define a mathematical representation of a polypeptide chain and the surrounding solvent, define an energy function that accurately represents the physiochemical properties of proteins and use and algorithm to search for a chain conformation which possesses the minimum free energy. The problem with ab initio methods is that even short polypeptide chains can fold into a potentially infinite number of structures. [1]

Background

There are three different views of proteins:

This strongly suggests that not all possible conformations have been tried in nature. In fact, nature tends to recycle what works.

We know that:

Based on the above, can we predict protein structures from protein sequences alone (ab initio)?

Factors effecting protein fold:

When proteins move from unfolded to folded conformations, they move from high energy state to lower energy state.

Successful structure prediction requires free energy function sufficiently close to the true potential for the native state to be at one of the lowest energy minima, as well as a method for searching conformational space for low energy minima. Ab initio structure prediction is challenging because current potential functions have limited accuracy, and the conformational space to be searched is vast. Many methods use reduced representations, simplified potentials, and coarse search strategies in recognition of this resolution limit. [2]

Representing a polypeptide chain

The most detailed representations include all atoms of the protein and the surrounding solvent molecules. However, representing this large number of atoms and the interaction between them is quite computationally expensive, and it is not clear that this level of detail is necessary during the phase of the search far from the native conformation. To streamline the calculations, representations can be simplified in a variety of ways such as reducing the size of the conformational space. [2]

Potential Functions

There are two categories of potentials that may be employed in evaluating the free energy of the peptide chain and the surrounding solvent.

Molecular mechanics describes interactions of atoms or groups. :

Search Methods

In searching, as in selecting appropriate level of detail in the representation and in the potential, one must choose granularity of the search based on the resolution desired from the method. Molecular dynamics is used for this as it models changes in conformation over time using a forcefield. A single is search is most likely to find the local minimum. Therefore, several iterations are need to find the global minimum.

Source

[1] Principles of Proteomics by R. M. Twyman
[2] Structural Bioinformatics by Bourne & Weissig

Fold Recognition - Threading

In the ab initio modeling section, we saw that nature uses a relatively small number of possible polypeptide chain conformations. Nature favors energetically favorable conformations. A protein chain folds either by condensation around a nucleation site or through intermediate stages rich in secondary structures. This means that looking at all possible conformations is wasteful.

Fold recognition methods detect folds that can be used for structural modeling with homology at the sequence level. The principle of fold recognition is the identification of folds that are compatible with a given query sequence i.e. instead of sequences being used to predict folds, the folds are fitted to the sequence. This involves:

- searching for known folds
- scoring folds
- identifying candidates that best fit the sequence
- aligning the query and the best-scoring proteins

Once such a template has been identified, the remainder of the process is the same as comparative modeling. Fold recognition methods are based on both sequence similarity searches and structural information.

Summary of Structure Prediction

Method Knowledge Approach Difficulty Usefulness
Secondary structure prediction sequence - structure statistics Cannot do 3D. Suitable for predicting H/E Medium very useful. if sequence identity is greater than 40%, suitable for drug design.
Homology Modeling proteins of known structure identify related structures with sequence methods, copy 3D coordinates and adjust easy useful
Ab Initio energy functions, statistics simulate folding or generate many candidate structures very difficult not yet useful
Fold Recognition proteins of known structure identify folds, compare sequence, copy 3D coordinates and adjust medium limited depending on the models

Source

[1] Principles of Proteomics by R. M. Twyman
[2] Structural Bioinformatics by Bourne & Weissig
[3] Lecture slides of Dr. Lina Yip

Structural Classification

Structure classification methods use structure alignments to help in the assignment of fold classes. Structure prediction methods require that the predicted structure be evaluated against a variety of template structures. Since structures are more conserved in evolution than sequences, structural alignments reveal distant sequence relationships not available from sequence alignments alone. Structural similarity is a more sensitive method than sequence alignment to determine protein function.

Quantifying Similarity

One way to quantify similarity is to superpose protein structures and calculate the distances between equivalent atoms. The distances are used to calculate the root mean square deviation (RMSD). RMSD measures the overall deviation of the atoms. It also amplifies large deviations in local regions of a protein. RMSD calculation usually involves only a subset of aligned atoms. The problem is to define this subset. Small RMSD using many atoms indicates a good structural alignment.

Structural superposition: We know at least some of the residues that match between the two proteins.
Structural alignment: We don't know any residues that match between the two proteins.

Structural superposition problem can be solved by taking the least square RMSD. Requires finding the right transformation. Solved in O(n) time. Structural alignment is an NP-hard problem. Requires comparing different proteins with different lengths. You can either compare both proteins directly or compare features separately.

Currently, there are a number of methods. Most of them can align the obvious features correctly but fail otherwise. Good alignments are rare and the software is slow.

Structural Alignment is a three step process:

  1. structural description of protein A and B for comparison

    sequence, secondary structure, structural attributes of individual amino acids, distance between amino acids in proteins
  2. optimise the alignment between A and B
    • point based methods
    • using vectors to represent secondary structures
    • computational methods - dynamic programming, heuristic, genetic algorithms, etc.
  3. measure the statistical significance of the alignment against some random set of structure comparisons

Functional annotation based on protein structure requires a rigorous and standardized system for the classification of different structures. Several different hierarchical classification schemes have been established, which divide proteins first into general classes based on the proportion of various secondary structures they contain, then into successively more specialized groups based on how those structures are arranged. These schemes are implemented in databases such as FSSP, CATH and SCOP. [1]

These databases classify differently:

Sometimes they classify the same protein differently. Further confusion is caused by structures which appear very often (superfolds). It is difficult to know whether a given superfold is homologous or analogous.

Source

[1] Principles of Proteomics by R. M. Twyman
[2] Structural Bioinformatics by Bourne & Weissig

Structure Quality Assurance

Why do we need structure quality assurance?
Everything we know about protein structures comes from PDB. PDB structures are used as templates to predict new other structures. If the template is wrong, then the model would also be wrong. Even though structures are determined experimentally, the result of the experiment is a model. Models can be accurate or wrong. In addition, experiments have associated errors.

X-ray crystallography model can contain chain connectivity, frame shift of fitting errors. How to ensure accurate x-ray structure? Resolution < 2 angstroms and R-factor less than 0.20.

Good parameters for structure validation
- Testable on real-space coordinates and/or crystallographic data
- Strongly correlated to structure quality
- Independent of the refinement process
- Automated and not too time-consuming computationally

Parameter to look at:
Experimental data
- R-factor, free R factor
- B-values
Basic geometry
- Bond length and angles
- Planarity (Peptide planes, Rings in sc (His, Phe, Tyr, Trp))
Dihedral angles
- φ, ψ (Ramachandran plot), ω
- X angles for side-chains (rotamer lib)
- Other dihedral angles (Cα)
Assessing local environment
- VdW interactions
- Packing
- Hydrogen bonding

Experimental Data

R-factor, Rfree
Measure of the difference between the structure factors calculated from the model and those obtained from experimental data. i.e. a measure of the differences in the observed and computed diffraction patterns.
High value -> poorer agreement, low value -> better agreement
R-factor < 0.2 is desired
R-factor values in the range 0.4 to 0.6 can be obtained from a totally random structure.
R free tend to be higher than R

B-factor
Closely related to the positional errors of the atoms
Larger B-factor > larger positional uncertainty

Basic Geometry

Deviation from ideal bond lengths and angles.

Dihedral Angles

Assessment of &phi and &psi values with reference to Ramachandran plot. Good structures show tight clustering in most favored regions. Measure the % of residues in favored regions, with the exception of G and P.

Accessing Local Environment: packing, bad contacts

Packing: Proteins in their native states are well packed.
- DACA makes use of threading potentials to calculate how well the sequence feels at home.
- Z-score tells us how well a residue feels with respect to its neighbors.
- ANOLEA calculates a non local energy for atom-atom contacts based on an atomic mean force potential. ANOLEA detects local packing errors and errors in alignment.
- Hydrogen bonds are a major stabilizing force. They can be studied for validation.

Bad contacts: where the sum of distance between pair of non-bonded atoms is smaller than the sum of VdW radii.

Structure Validation Servers

- WHAT IF / WHATCHECK
- PROCHECK
- Verify3D
- VADAR
- ANOLEA
- ERRAT

WHAT IF / WHATCHECK

In a WHAT-CHECK report, each reported fact has an assigned severity:
Error: Severe errors encountered during the analysis
Warning: Either less severe problems or uncommon features
Note: Statistical values plots or other verbose results of tests and analyses

Molecular Recognition

Molecular recognition refers to the specific interactions between two or more molecules through noncovalent bonding such as hydrogen bonding, metal coordination, hydrophobic forces, van der Waals forces, etc. Here we will be taking about molecular recognition in terms of drug design. Molecular recognition for drug design involves target selection, lead discovery and lead development.

Target selection involves the site or molecule we would like to modify.

A drug is a small molecule (ligand) able to bind a therapeutic target (enzyme, receptor,etc.) and modulate its activity.

The most important molecules in drug design are enzymes. The basic mechanism by which enzymes catalyze chemical reactions begins with the binding of the substrate (or substrates) to the active site on the enzyme. The active site is the specific region of the enzyme which combines with the substrate. The binding of the substrate to the enzyme causes changes in the distribution of electrons in the chemical bonds of the substrate and ultimately causes the reactions that lead to the formation of products. The products are released from the enzyme surface to regenerate the enzyme for another reaction cycle. [1]

The active site has a unique geometric shape that is complementary to the geometric shape of a substrate molecule, similar to the fit of puzzle pieces. This means that enzymes specifically react with only one or a very few similar compounds.

Lock and Key Model

The specific action of an enzyme with a single substrate can be explained using a Lock and Key analogy first postulated in 1894 by Emil Fischer. In this analogy, the lock is the enzyme and the key is the substrate. Only the correctly sized key (substrate) fits into the key hole (active site) of the lock (enzyme).

Induced Fit Model

Not all experimental evidence can be adequately explained by using the so-called rigid enzyme model assumed by the lock and key theory. The induced-fit theory assumes that the substrate plays a role in determining the final shape of the enzyme and that the enzyme is partially flexible. This explains why certain compounds can bind to the enzyme but do not react because the enzyme has been distorted too much. Other molecules may be too small to induce the proper alignment and therefore cannot react. Only the proper substrate is capable of inducing the proper alignment of the active site.

Molecular recognition is the collection of interactions between molecules that allow their binding. This depends on the nature of interactions and the intensity of molecular recognition. The interactions include electrostatic interactions, van der Waals interactions, hydrophobic interactions and other noncovalent interactions. Other factors such as the solvation effect, π interactions, intramolecular changes upon binding, and entropy changes upon binding also effect in determining the recognition strength.

Theoretical approaches for estimating binding affinities

Without the 3D structure of the complex we can use:

With the 3D structure of the complex, we can use:

QSAR - Quantitative Structure Activity Relationships

Drug design is an iterative process which begins with a compound that displays an interesting biological profile and ends with optimizing both the activity profile for the molecule and its chemical synthesis. The process is initiated when the chemist conceives a hypothesis which relates the chemical features of the molecule (or series of molecules) to the biological activity. Without a detailed understanding of the biochemical process(es) responsible for activity, the hypothesis generally is refined by examining structural similarities and differences for active and inactive molecules. Compounds are selected for synthesis which maximize the presence of functional groups or features believed to be responsible for activity. [2]

The combinatorial possibilities of this strategy for even simple systems can be explosive. As an example, the number of compounds required for synthesis in order to place 10 substituents on the four open positions of an asymmetrically disubstituted benzene ring system is approximately 10,000. The alternative to this labor intensive approach to compound optimization is to develop a theory that quantitatively relates variations in biological activity to changes in molecular descriptors which can easily be obtained for each compound. A Quantitative Structure Activity Relationship (QSAR) can then be utilized to help guide chemical synthesis. [2]

QSAR assumes that chemically similar ligands produce biologically similar responses. It also assumes that affinity is a function of the ligand's physico-chemical properties. The advantage of this technique is that we do not need any structural information about the target. This technique, however, requires knowledge of affinities for a series of ligands adn knowledge of structurally related ligands or similar binding modes. A QSAR attempts to find consistent relationships between the variations in the values of molecular properties and the biological activity for a series of compounds so that these "rules" can be used to evaluate new chemical entities.

We take n structurally related molecules and we make a table of their quantitative descriptions vs. measured activities. Descriptors can be volume, electrostatics, hydrophobicity, etc. We then use this to calculate δG for binding. This data is used as training set molecules.

2D QSAR is limited to structurally related molecules. It requires experimental activity of a series of ligands, not ab initio studies. This method suffers from overfitting like many training algorithm based methods. The use of a particular QSAR is limited to the descriptors used in the training set.

3D QSAR is the same as 2D QSAR with the addition of x,y,z coordinates of the atoms. It also needs experimental activity of a series of ligands. It is NOT limited to structurally related molecules.

Free Energy Simulation

Sources

[1] http://www.elmhurst.edu/~chm/vchembook/571lockkey.html
[2] http://www.netsci.org/Science/Compchem/feature19.html

Questions

If you have to take an exam on structural bioinformatics, following are some questions and answers to help you.

Protein Structure

Draw Glycine and Proline

           H     Glycine
         α |        
  +H3N --- C --- COOH
           |
           H

          H     Proline
        α |        
   HN --- C ---COOH
    |     |
    CH2   CH2
     \   /
      CH2

Draw amino acids at pH 7 and pH 3
At pH 3: NH3+ and COOH
At pI: NH3+ and COO-
At pH 7: NH2 and COO-

List acidic amino acids
Acidic = negatively charged = ED

List basic amino acids
KRH

List polar amino acids
SCN WYTH

List hydrophobic amino acids
FW MAIL PGV

Draw Peptide Bond and describe its characteristics

Peptide bond is a covalent bond. Peptide bonds are usually formed in trans formation to prevent crowding. The peptide bond is rigid and planar. Therefore, the polypeptide chain can only rotate about the bonds formed by C-α. These bonds have been termed the Phi φ and Psi ψ angles. The rotational freedom about &phi and ψ angles is limited by steric hindrance between the side chains of the residues and the peptide backbone. Consequently, the possible conformations of a given polypeptide chain are quite limited.

φ - Cα-N
ψ - Cα-C'

Why does is have a double bond character?
Due to the electrons around electronegative NH

What does omega = 0 or omega = 180 do?
This is the normal position of the bond as it is usually not free to rotate.

What would happen if both phi and psi are 0
This is the hindered position. When both are 180, it is the fully extended position.

Explain Ramachandran plot. What is it used for?
A Ramachandran Plot is a plot of φ vs. ψ angles. It maps the entire conformational space of a polypeptide and illuminates the allowed and disallowed conformations. Different amino acids have different preferences of ?-? angles.

Some key exceptions to these conformational limitations can be attributed to glycine and proline. The single H side chain of Glycine greatly reduces steric hindrance and expands the possible conformational space. The cyclic bond present in proline reduces the conformational space.

The nature of protein sequence and composition reflects its function. Membrane proteins have more hydrophobic residues. Homologous proteins often have similar sequences. Sequence similarity often implies similar secondary and tertiary structures.

Why don’t we calculate the ramachandran plot for proline?
It conformational space is so limited that it cannot be accurately shown in a ramachandran plot.

What is pKa
pKa is the negative log of the acid ionization constant. It is defined as the ability of an ionizable group of an organic compound to donate a proton (H+) in an aqueous media. pKa values of amino acid side chains play an important role in defining the pH-dependent characteristics of a protein e.g enzyme activity, protein stability, etc. Enzymes become active only under certain conditions.

What are the 4 levels of protein structures
Primary: the amino acid sequence
Secondary: Local conformation of main-chain atoms (F and Y angles), how the amino acids in sequence fold up locally. helix and strand.
Tertiary: 3-D folding or arrangement of the secondary structural elements and connecting loops in space. Stabilized by vdW, hydrophobic effect, H-bonds, salt bridges, metal coordination, disulfide bonds
Quaternary: 3-D arrangement of multiple subunits, each with a tertiary structure and each a unique gene product. Often symmetrical.

What are domains and motifs
Motifs: limited number of secondary structure elements combined into simple folds.
Domains: several motifs packed in a specific, compact arrangement that in many cases can fold as an independent unit.

What are the different noncovalent forces
Van der Waal’s forces (0.01-0.2 kcal/mol):
Weak forces between molecules that are brought about by localized charge fluctuations. Can be attractive or repulsive. Major contributor of protein stability.

Hydrophobic effect:
The most powerful force stabilizing protein structure. Basis of force is entropy gain realized by burying hydrophobic residues. Residues are very tightly packed against one another in the protein core.

Hydrogen bonds (2-10 kcal/mol):
Involve the sharing of a hydrogen atom between two eletronegative atoms (e.g., O, N). Directional

Salt bridges (1-5 kcal/mol):
Involve the interaction of (+) and (-) charged side groups (i.e. basic and acidic residues). Strength is influenced by pH, ionic strength, and the local electrostatic environment. Long-range forces.

X-ray Crystallography

What the the steps in X-ray crystallography?

1. protein expression
2. protein purification
3. crystal production
4. x-ray diffraction & phasing
5. data collection - analysis of diffraction patterns
6. model construction - I = A2 (intensity = square of amplitude)

How do we make crystals?
Crystal formation involves three steps:

1. Nucleation
2. Growth
3. Cessation of Growth

First we supersaturate a protein solution. The goal is keep the solution in the metastable of labile zone for as long as possible. Nucleation occurs in the metastable zone. Crystal growth occurs in the labile zone. Vapor diffusion or a similar method is used to grow crystals. Cat whiskers are used for seeding crystal growth.

Why do we need to cool the crystal during a diffraction experiment?
During the experiment, the crystal heats up, emitting different diffraction pattern. Therefore, to keep the diffraction rays monochromatic, we need to cool the crystal.

What information do we obtain from a diffraction experiment? What do we not obtain?
We get a diffraction pattern which can then be used to solve a crystal, which requires calculating an electron density map. To calculate an electron density map from the diffraction patterns, we require three pieces of information:

1. wavelength λ of the incident x-rays - this is already known
2. amplitude of the scattered x-rays - this can be determined by the intensity of the reflections
3. phase of diffraction - this is not known and cannot be determined from the pattern of reflections.

What do we need to do to obtain the missing information?
Further experiments are usually necessary to determine diffraction phases. The standard approach is to produce heavy atom-containing isomorphous crystals. These crystals have the same structure but would produce alternative diffraction patterns. This is achieved by soaking the protein crystals into heavy metal salt solution so that the heavy metal atoms diffuse into spaces originally occupied by the solvent. By comparing the reflections generated by several different isomorphous crystals (MIR - multiple isomorphous replacement) the positions of the heavy atoms can be worked out and this allows the phase diffraction in the unsubstituted crystal to be deduced.

Using the MIR process we acquire:

- amplitude and phase of heavy atoms
- amplitude of protein
- amplitudes of protein and heavy metal

The phase of the protein can then be estimated from these three amplitudes and one phase. The phase information is then used to construct an electron density map by means of a Fourier transform.

Finally, a structural model is built into the electron density map. This requires one more crucial piece of information - the amino acid sequence because C, O, N atoms cannot be distinguished with certainty by x-ray diffraction so amino acid side chains are difficult to identify.

Explain unit cells and space groups

If an asymmetric unit contains 8 monomer, then the unit is composed of 8 amino acids.

Space Groups

Crystals and lattices can be classified into several space group based on how they favor filling space in a crystal lattice.

The combination of 14 Bravais lattices with 32 point groups and additional translational components such as screw angles and glide planes give a total of 230 groups. Of these only 65 space groups without mirror planes and inversion centers are possible for protein crystals.

X-rays can diffract with both constructive and destructive interference. Constructive inference is when the wavelength travel in unison while destructive interference is the opposite. By unison, we are referring to having the same amplitude and phase.

The pattern of diffraction allows direct determination of the unit cell and geometry (space group).

The resolution is calculated by: dmin = λ / 2 sin θ. A resolution of 2 or less angstroms is considered high resolution. Anything close to 6 angstroms is considered low resolution.

What is the phase problem?
1. We know the wavelength of incident x-rays
2. We can determine the amplitude of scattered x-rays using the intensity of reflections
3. Phase is not known. MIR process is used to solve the phase problem.

How do we access the quality of an x-ray crystallography structure?
From its resolution. Less that 2 angstroms is high resolution. Above 6 is low resolution.

Why can't we use electron microscope to determine crystal structures? Explain method's principle.
The wavelength of electron microscope is less than the distance between 2 atoms. Thus it cannot provide data at atomic level. Methods principle defines the best resolution which can be achieved from a method. It is the wavelength divided by 2. It this is less than the distance between 2 atoms, we have atomic level resolution.

What is R-factor and B-factor?
R-factor, Rfree
Measure of the difference between the structure factors calculated from the model and those obtained from experimental data. i.e. a measure of the differences in the observed and computed diffraction patterns.
High value -> poorer agreement, low value -> better agreement
R-factor < 0.2 is desired
R-factor values in the range 0.4 to 0.6 can be obtained from a totally random structure.
R free tend to be higher than R

B-factor
Closely related to the positional errors of the atoms
Larger B-factor > larger positional uncertainty

X-ray crystallography - advantages and disadvantages
Advantages
high resolution
no protein mass limit

Disadvantages
crystals needed
structure is static average
H are usually not seen
possible artifacts due to crystal content and precipitation

NMR - advantages and disadvantages?
Advantages

* no chemical modification necessary
* protein in solution: no crystal packing artifacts,
* allows direct binding experiments, hydrodynamic and folding studies
* assignment of labile regions possible: no gaps in structure

Disadvantages

* protein in solution: protein has to be soluble
* insensitive method: requires high concentrations of proteins
* overlap: direct determination of 3D structures for small
* proteins only (150-200 residues)

Structure Classification

What are the protein structure classification databases? How do they classify proteins? Are the classifications conflicting? Why?
Functional annotation based on protein structure requires a rigorous and standardized system for the classification of different structures. Several different hierarchical classification schemes have been established, which divide proteins first into general classes based on the proportion of various secondary structures they contain, then into successively more specialized groups based on how those structures are arranged. These schemes are implemented in databases such as FSSP, CATH and SCOP. [1]

These databases classify differently:

* FSSP is implemented automatically using DALI
* CATH is semi-automatic, automated with SSAP but the results are curated
* SCOP is fully manual classification

Sometimes they classify the same protein differently. Further confusion is caused by structures which appear very often (superfolds). It is difficult to know whether a given superfold is homologous or analogous.

How is information classified in CATH?
1. Class: the overall secondary-structure content of the domain
2. Architecture: a large-scale grouping of topologies which share particular structural features
3. Topology: high structural similarity but no evidence of homology. Equivalent to a fold in SCOP
4. Homologous superfamily: indicative of a demonstrable evolutionary relationship. Equivalent to the superfamily level of SCOP.

You found a new protein? What should you do next?
Learn about your protein. Start by searching for homologs. Blast -> PSI-BLAST -> structural alignment.

What is structure classification? Major steps? Motivation?
Protein classification refers to clustering proteins into protein families. It involves breaking a protein chain or complex into its constituent domains and assigning folds to domains. The motivation behind protein classification is the analysis of evolutionary mechanisms and providing data for protein structure prediction methods.

What are folds, domains, and motifs?
Folds refers to the arrangement and connectivity of secondary structure elements. Folds contain information on protein function and distant evolutionary relationships.

Domains are independent folding elements with their own hydrophobic core. Globular units. They are regions with distinct functions. They may be connected to each other rigidly, or loosely.

Motifs do not describe an overall structure. They are parts of a protein that can be found in many other proteins sometime with different folds. e.g. ATP binding motif

What are the problems in fold classification
1. structure space has a continuous aspect. Important to decide how to divide.
2. russian doll effect. A continuous range of slight size differences will lead to clustering proteins of very different size.
3. motif overlap. A continuous range of overlapping common cores AB > BC > CD will lead to grouping proteins that have no common core.

Compare SCOP and CATH
The database describes structural and evolutionary relationships between proteins of known structure. Unit of classification is protein domain. Classification done by manual visual inspection and various automatic tools. Many levels exist in the hierarchy; the principal levels are family, superfamily and fold

CATH more directed toward structural classification,
SCOP pays more attention to evolutionary relationships

In CATH, there is one class to represent mixed alpha-beta.
In SCOP, there are two:
a/b: beta structure is largely parallel, made of bab motifs
a+b: alpha and beta structure segregated to different parts of structure

Identifying motifs - prosite.

Why can't we identify large proteins with NMR
Too much signal

Can we detect lipids with NMR?
No. The molecules are too big.

What kind of atoms can you detect with NMR?
1H, 31P, 13C, 15N

Why are the atoms with half spin the best atoms for NMR?
They align themselves with the field when placed in a constant magnetic field

What is FID
Free induction decay

What is fourier transform used for
To amplify wavelengths

Summarize NMR
Purified proteins in a solution are placed inside a superconducting magnet. Atoms with half spins align themselves with the magnetic field. RF signal induces these atoms to jump to an unfavorable state. When they jump back to a favorable state, them emit radio waves which can be measured. This is called FID.

NMR involves:

1. NMR experiment
2. Data collection
3. Spectrum assignment
4. Structure calculation

What is the purpose of secondary structure prediction?
The goal of secondary structure prediction is to assign or predict a secondary structure state (?, ?, coil) given an amino acid sequence. We need to predict protein structures since experimental methods are very time consuming. In addition, it is not possible to solve all protein experimentally as explained in x-ray crystallography and NMR sections. Secondary structure prediction is the first step towards structure determination. It is usually followed by tertiary structure determination.

What is the difference between protein structure prediction and structure assignment
* Secondary Structure Assignment: You know the structure and you deduce the secondary structure from this structure.
* Secondary Structure Prediction: You don't know the structure and you deduce it from the sequence.

List structure assignment tools and explain how they work assign secondary structures
DSSP: Uses backbone-backbone H-bonds.
Stride: Uses empirically derived H-bond energy and phi-psi torsion angles to assign structures. Optimized to mirror experimental structures
DEFINE: relies on C&alphs; angles
P-CURVE: based on definition of helicoidal parameters
KAKSI: based on Cα and torsion angles

Characteristics of H and E
Signals for alpha helices
* characteristic hydrophobicity profiles
* prolines disrupt the middles of helices
* period of 3.6
* conserved hydrophobics at i, i+3, i+4, i+7

Signals for coils
* gapped in multiple alignments
* small polar residues (Ala, Gly, Ser Thr)
* prolines rarer in other kinds of secondary structure

How do you measure accuracy of structure prediction
Accuracy of prediction is measure by the Q3 measure (per residue prediction accuracy) or SOV, segment overlap value (per segment prediction accuracy). Q3 give the percentage of correctly predicted residues in ?, ? and other states. SOV tells how the secondary structural elements have been predicted. It measures:

* number of segments in proteins
* average segment length
* distribution of number of segments with length

First generation structure prediction tools were knowledge-based. They used single residue statistics, databases of limited size, and preferred particular residues for certain secondary structure elements. Overall, they have < 55% Q3 accuracy.

The second generation structure prediction tools use machine learning. They use larger database, and produce segment based statistics and take neighbors into account. The algorithms used statistical information, sequence patterns, neural networks, etc. ALB, COMBINE, and GORIII are such methods which have < 55% Q3 accuracy.

The third generation structure prediction tools use evolutionary information as well. PHD and PSIPRED have ~ 75% +/- 11 Q3 accuracy.

What are secondary structure prediction tools useful for?
* Chain tracing
* Starting point for 3D structure modelling
* Fold recognition
* Homology modelling
* Functional assignment

What is homology modeling?
The ultimate goal of protein modeling is to predict a structure from its sequence with a accuracy that is comparable to the best results achieved experimentally. The idea behind homology modeling is to use experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target). Homology modeling is based on two observations:

* Protein structure is entirely determined by its amino acid sequence
* Structure is more stable than sequence over evolutionary periods so similar sequence usually fold into similar structures

Homology modeling involves following steps:
* target
* template selection
* alignment
* model building
* model evaluation

How accurate or reliable is homology modeling?
Homology models are classified into 3 area in terms of their accuracy and reliability.

* Midnight Zone Less than 20% sequence identity. The structure cannot reliably be used as a template.
* Twilight Zone 20% - 40% sequence identity. Sequence identity may imply structural identity.
* Safe Zone 40% or more sequence identity. It is very likely that sequence identity implies structural identity.

How good can homology modeling be?
* 60 - 100% Comparable to medium resolution NMR substrate specificity
* 30 - 60% Comparable to molecular replacement in crystallography. Support site-directed mutagenesis through visualization
* < 30% Serious errors

What is ab initio modeling?
Ab initio structure prediction seeks to predict the native conformation of a protein from amino acid sequence alone. Comparative modeling depends on finding a suitable template structure. In the absence of a suitable structure, ab initio prediction is the only method. A typical procedure would be to define a mathematical representation of a polypeptide chain and the surrounding solvent, define an energy function that accurately represents the physiochemical properties of proteins and use and algorithm to search for a chain conformation which possesses the minimum free energy. The problem with ab initio methods is that even short polypeptide chains can fold into a potentially infinite number of structures.

How do we quantify similarity between molecules?
RMSD
Small RMSD, many atoms = good alignment.

Structural alignment vs. structural superposition
Superposition assumes that one knows of at least some residues that match between protein structures A and B. Easy problem with exact solution. RMSD with translation and rotation. O(n) complexity.
Structural alignment: we don’t know. Must determine which atoms to align. NP-hard problem.

How can we solve structural alignment problems?
Two approaches:
1. compare two proteins directly
2. compare structural features of each protein separately
Most methods are able to identify obvious similarities easily.

Steps in structural alignment
1. Structural description of protein A and B
2. optimize the alignment between A and B
3. Measure the statistical significance of alignment against some random set of comparison

Alignment algorithms
1. Point-based methods use points (distances) to establish correspondences
2. Secondary structure based methods use vectors representing secondary structures to establish correspondences

What are the advantages and disadvantages of the distance matrix approach
Advantages
- invariant with respect to rotation and translation
Disadvantages
- the distance matrix is O(n2) for a protein with n residues
- comparing distance matrix is a hard problem
- insensitive to chirality

Structal
Uses dynamic programming (DP) iteratively to refine an arbitrary starting alignment.
STEPS:
1. Start with any set of correspondences between two structures (sequence
alignment, secondary structure alignment…).
2. Compute a score matrix by computing a score between all pairs of points
based on their distance.
3. Trace back through the score matrix to find a new set of correspondences that maximizes the score (standard DP)
4. Iterate 2 and 3 until score doesn’t change.
Note: The method is heuristic ? no guarantees of success, depends on quality
of starting structure.

Dali - Distance Alignment
Uses distance matrix to find similar patterns of distances, indicating correspondences.
Find all pairs of matching hexapeptides in the two proteins to be compared
Concatenate matching hexapeptides using simulated annealing.
Aim: Maximize the number of atoms, minimize RMSD
The assembly step is done in a random fashion, the search space is too large

SARF2
Uses vectors associated with secondary structures to do quick screen for similar structures, followed by refinement of distances.
1. Identify secondary structure elements (SSEs) and represent them as vectors.
2. Compare between pairs of SSEs in different proteins
Pairs of SSEs having the same orientation are marked
Orientation between 2 SSEs is defined by 5 parameters
3. Search for the largest ensemble of the compatible pairs of SSEs
NP-complete problem, done by graph theory algorithms (maximum clique)
4. Extension and refinement of the match

CE - dynamic programming
COMPARER - secondary structure and hydrophobic clusters
VAST - secondary structure, monte carlo

How can we evaluate structural alignments?
1. Number of amino acid correspondences created.
2. RMSD of corresponding amino acids
3. Percent identity in aligned residues
4. Number of gaps introduced
5. Size of the two proteins
6. Conservation of known active site environments …

Most statistical significance score are based on geometric properties. Some measure protein similarity. If the z-score is greater than 3, then the two structures are significantly similar (assuming normal distribution).

Multiple structure alignment servers
MASS, multiprot, SSM, Mammoth