Questions



If you have to take an exam on structural bioinformatics, following are some questions and answers to help you.

Protein Structure

Draw Glycine and Proline

           H     Glycine
         α |        
  +H3N --- C --- COOH
           |
           H

          H     Proline
        α |        
   HN --- C ---COOH
    |     |
    CH2   CH2
     \   /
      CH2

Draw amino acids at pH 7 and pH 3
At pH 3: NH3+ and COOH
At pI: NH3+ and COO-
At pH 7: NH2 and COO-

List acidic amino acids
Acidic = negatively charged = ED

List basic amino acids
KRH

List polar amino acids
SCN WYTH

List hydrophobic amino acids
FW MAIL PGV

Draw Peptide Bond and describe its characteristics

Peptide bond is a covalent bond. Peptide bonds are usually formed in trans formation to prevent crowding. The peptide bond is rigid and planar. Therefore, the polypeptide chain can only rotate about the bonds formed by C-α. These bonds have been termed the Phi φ and Psi ψ angles. The rotational freedom about &phi and ψ angles is limited by steric hindrance between the side chains of the residues and the peptide backbone. Consequently, the possible conformations of a given polypeptide chain are quite limited.

φ - Cα-N
ψ - Cα-C'

Why does is have a double bond character?
Due to the electrons around electronegative NH

What does omega = 0 or omega = 180 do?
This is the normal position of the bond as it is usually not free to rotate.

What would happen if both phi and psi are 0
This is the hindered position. When both are 180, it is the fully extended position.

Explain Ramachandran plot. What is it used for?
A Ramachandran Plot is a plot of φ vs. ψ angles. It maps the entire conformational space of a polypeptide and illuminates the allowed and disallowed conformations. Different amino acids have different preferences of ?-? angles.

Some key exceptions to these conformational limitations can be attributed to glycine and proline. The single H side chain of Glycine greatly reduces steric hindrance and expands the possible conformational space. The cyclic bond present in proline reduces the conformational space.

The nature of protein sequence and composition reflects its function. Membrane proteins have more hydrophobic residues. Homologous proteins often have similar sequences. Sequence similarity often implies similar secondary and tertiary structures.

Why don’t we calculate the ramachandran plot for proline?
It conformational space is so limited that it cannot be accurately shown in a ramachandran plot.

What is pKa
pKa is the negative log of the acid ionization constant. It is defined as the ability of an ionizable group of an organic compound to donate a proton (H+) in an aqueous media. pKa values of amino acid side chains play an important role in defining the pH-dependent characteristics of a protein e.g enzyme activity, protein stability, etc. Enzymes become active only under certain conditions.

What are the 4 levels of protein structures
Primary: the amino acid sequence
Secondary: Local conformation of main-chain atoms (F and Y angles), how the amino acids in sequence fold up locally. helix and strand.
Tertiary: 3-D folding or arrangement of the secondary structural elements and connecting loops in space. Stabilized by vdW, hydrophobic effect, H-bonds, salt bridges, metal coordination, disulfide bonds
Quaternary: 3-D arrangement of multiple subunits, each with a tertiary structure and each a unique gene product. Often symmetrical.

What are domains and motifs
Motifs: limited number of secondary structure elements combined into simple folds.
Domains: several motifs packed in a specific, compact arrangement that in many cases can fold as an independent unit.

What are the different noncovalent forces
Van der Waal’s forces (0.01-0.2 kcal/mol):
Weak forces between molecules that are brought about by localized charge fluctuations. Can be attractive or repulsive. Major contributor of protein stability.

Hydrophobic effect:
The most powerful force stabilizing protein structure. Basis of force is entropy gain realized by burying hydrophobic residues. Residues are very tightly packed against one another in the protein core.

Hydrogen bonds (2-10 kcal/mol):
Involve the sharing of a hydrogen atom between two eletronegative atoms (e.g., O, N). Directional

Salt bridges (1-5 kcal/mol):
Involve the interaction of (+) and (-) charged side groups (i.e. basic and acidic residues). Strength is influenced by pH, ionic strength, and the local electrostatic environment. Long-range forces.

X-ray Crystallography

What the the steps in X-ray crystallography?

1. protein expression
2. protein purification
3. crystal production
4. x-ray diffraction & phasing
5. data collection - analysis of diffraction patterns
6. model construction - I = A2 (intensity = square of amplitude)

How do we make crystals?
Crystal formation involves three steps:

1. Nucleation
2. Growth
3. Cessation of Growth

First we supersaturate a protein solution. The goal is keep the solution in the metastable of labile zone for as long as possible. Nucleation occurs in the metastable zone. Crystal growth occurs in the labile zone. Vapor diffusion or a similar method is used to grow crystals. Cat whiskers are used for seeding crystal growth.

Why do we need to cool the crystal during a diffraction experiment?
During the experiment, the crystal heats up, emitting different diffraction pattern. Therefore, to keep the diffraction rays monochromatic, we need to cool the crystal.

What information do we obtain from a diffraction experiment? What do we not obtain?
We get a diffraction pattern which can then be used to solve a crystal, which requires calculating an electron density map. To calculate an electron density map from the diffraction patterns, we require three pieces of information:

1. wavelength λ of the incident x-rays - this is already known
2. amplitude of the scattered x-rays - this can be determined by the intensity of the reflections
3. phase of diffraction - this is not known and cannot be determined from the pattern of reflections.

What do we need to do to obtain the missing information?
Further experiments are usually necessary to determine diffraction phases. The standard approach is to produce heavy atom-containing isomorphous crystals. These crystals have the same structure but would produce alternative diffraction patterns. This is achieved by soaking the protein crystals into heavy metal salt solution so that the heavy metal atoms diffuse into spaces originally occupied by the solvent. By comparing the reflections generated by several different isomorphous crystals (MIR - multiple isomorphous replacement) the positions of the heavy atoms can be worked out and this allows the phase diffraction in the unsubstituted crystal to be deduced.

Using the MIR process we acquire:

- amplitude and phase of heavy atoms
- amplitude of protein
- amplitudes of protein and heavy metal

The phase of the protein can then be estimated from these three amplitudes and one phase. The phase information is then used to construct an electron density map by means of a Fourier transform.

Finally, a structural model is built into the electron density map. This requires one more crucial piece of information - the amino acid sequence because C, O, N atoms cannot be distinguished with certainty by x-ray diffraction so amino acid side chains are difficult to identify.

Explain unit cells and space groups

  • An asymmetric unit is the smallest entity (molecule) of the crystal that has no symmetry.
  • The unit cell is built by applying symmetry operators and translation along the 3 axis (X,Y,Z)
  • The side of the unit cells form the axis of the crystal (a, b, c, α, β, γ)

If an asymmetric unit contains 8 monomer, then the unit is composed of 8 amino acids.

Space Groups

Crystals and lattices can be classified into several space group based on how they favor filling space in a crystal lattice.

The combination of 14 Bravais lattices with 32 point groups and additional translational components such as screw angles and glide planes give a total of 230 groups. Of these only 65 space groups without mirror planes and inversion centers are possible for protein crystals.

X-rays can diffract with both constructive and destructive interference. Constructive inference is when the wavelength travel in unison while destructive interference is the opposite. By unison, we are referring to having the same amplitude and phase.

The pattern of diffraction allows direct determination of the unit cell and geometry (space group).

The resolution is calculated by: dmin = λ / 2 sin θ. A resolution of 2 or less angstroms is considered high resolution. Anything close to 6 angstroms is considered low resolution.

What is the phase problem?
1. We know the wavelength of incident x-rays
2. We can determine the amplitude of scattered x-rays using the intensity of reflections
3. Phase is not known. MIR process is used to solve the phase problem.

How do we access the quality of an x-ray crystallography structure?
From its resolution. Less that 2 angstroms is high resolution. Above 6 is low resolution.

Why can't we use electron microscope to determine crystal structures? Explain method's principle.
The wavelength of electron microscope is less than the distance between 2 atoms. Thus it cannot provide data at atomic level. Methods principle defines the best resolution which can be achieved from a method. It is the wavelength divided by 2. It this is less than the distance between 2 atoms, we have atomic level resolution.

What is R-factor and B-factor?
R-factor, Rfree
Measure of the difference between the structure factors calculated from the model and those obtained from experimental data. i.e. a measure of the differences in the observed and computed diffraction patterns.
High value -> poorer agreement, low value -> better agreement
R-factor < 0.2 is desired
R-factor values in the range 0.4 to 0.6 can be obtained from a totally random structure.
R free tend to be higher than R

B-factor
Closely related to the positional errors of the atoms
Larger B-factor > larger positional uncertainty

X-ray crystallography - advantages and disadvantages
Advantages
high resolution
no protein mass limit

Disadvantages
crystals needed
structure is static average
H are usually not seen
possible artifacts due to crystal content and precipitation

NMR - advantages and disadvantages?
Advantages

* no chemical modification necessary
* protein in solution: no crystal packing artifacts,
* allows direct binding experiments, hydrodynamic and folding studies
* assignment of labile regions possible: no gaps in structure

Disadvantages

* protein in solution: protein has to be soluble
* insensitive method: requires high concentrations of proteins
* overlap: direct determination of 3D structures for small
* proteins only (150-200 residues)

Structure Classification

What are the protein structure classification databases? How do they classify proteins? Are the classifications conflicting? Why?
Functional annotation based on protein structure requires a rigorous and standardized system for the classification of different structures. Several different hierarchical classification schemes have been established, which divide proteins first into general classes based on the proportion of various secondary structures they contain, then into successively more specialized groups based on how those structures are arranged. These schemes are implemented in databases such as FSSP, CATH and SCOP. [1]

These databases classify differently:

* FSSP is implemented automatically using DALI
* CATH is semi-automatic, automated with SSAP but the results are curated
* SCOP is fully manual classification

Sometimes they classify the same protein differently. Further confusion is caused by structures which appear very often (superfolds). It is difficult to know whether a given superfold is homologous or analogous.

How is information classified in CATH?
1. Class: the overall secondary-structure content of the domain
2. Architecture: a large-scale grouping of topologies which share particular structural features
3. Topology: high structural similarity but no evidence of homology. Equivalent to a fold in SCOP
4. Homologous superfamily: indicative of a demonstrable evolutionary relationship. Equivalent to the superfamily level of SCOP.

You found a new protein? What should you do next?
Learn about your protein. Start by searching for homologs. Blast -> PSI-BLAST -> structural alignment.

What is structure classification? Major steps? Motivation?
Protein classification refers to clustering proteins into protein families. It involves breaking a protein chain or complex into its constituent domains and assigning folds to domains. The motivation behind protein classification is the analysis of evolutionary mechanisms and providing data for protein structure prediction methods.

What are folds, domains, and motifs?
Folds refers to the arrangement and connectivity of secondary structure elements. Folds contain information on protein function and distant evolutionary relationships.

Domains are independent folding elements with their own hydrophobic core. Globular units. They are regions with distinct functions. They may be connected to each other rigidly, or loosely.

Motifs do not describe an overall structure. They are parts of a protein that can be found in many other proteins sometime with different folds. e.g. ATP binding motif

What are the problems in fold classification
1. structure space has a continuous aspect. Important to decide how to divide.
2. russian doll effect. A continuous range of slight size differences will lead to clustering proteins of very different size.
3. motif overlap. A continuous range of overlapping common cores AB > BC > CD will lead to grouping proteins that have no common core.

Compare SCOP and CATH
The database describes structural and evolutionary relationships between proteins of known structure. Unit of classification is protein domain. Classification done by manual visual inspection and various automatic tools. Many levels exist in the hierarchy; the principal levels are family, superfamily and fold

CATH more directed toward structural classification,
SCOP pays more attention to evolutionary relationships

In CATH, there is one class to represent mixed alpha-beta.
In SCOP, there are two:
a/b: beta structure is largely parallel, made of bab motifs
a+b: alpha and beta structure segregated to different parts of structure

Identifying motifs - prosite.

Why can't we identify large proteins with NMR
Too much signal

Can we detect lipids with NMR?
No. The molecules are too big.

What kind of atoms can you detect with NMR?
1H, 31P, 13C, 15N

Why are the atoms with half spin the best atoms for NMR?
They align themselves with the field when placed in a constant magnetic field

What is FID
Free induction decay

What is fourier transform used for
To amplify wavelengths

Summarize NMR
Purified proteins in a solution are placed inside a superconducting magnet. Atoms with half spins align themselves with the magnetic field. RF signal induces these atoms to jump to an unfavorable state. When they jump back to a favorable state, them emit radio waves which can be measured. This is called FID.

NMR involves:

1. NMR experiment
2. Data collection
3. Spectrum assignment
4. Structure calculation

What is the purpose of secondary structure prediction?
The goal of secondary structure prediction is to assign or predict a secondary structure state (?, ?, coil) given an amino acid sequence. We need to predict protein structures since experimental methods are very time consuming. In addition, it is not possible to solve all protein experimentally as explained in x-ray crystallography and NMR sections. Secondary structure prediction is the first step towards structure determination. It is usually followed by tertiary structure determination.

What is the difference between protein structure prediction and structure assignment
* Secondary Structure Assignment: You know the structure and you deduce the secondary structure from this structure.
* Secondary Structure Prediction: You don't know the structure and you deduce it from the sequence.

List structure assignment tools and explain how they work assign secondary structures
DSSP: Uses backbone-backbone H-bonds.
Stride: Uses empirically derived H-bond energy and phi-psi torsion angles to assign structures. Optimized to mirror experimental structures
DEFINE: relies on C&alphs; angles
P-CURVE: based on definition of helicoidal parameters
KAKSI: based on Cα and torsion angles

Characteristics of H and E
Signals for alpha helices
* characteristic hydrophobicity profiles
* prolines disrupt the middles of helices
* period of 3.6
* conserved hydrophobics at i, i+3, i+4, i+7

Signals for coils
* gapped in multiple alignments
* small polar residues (Ala, Gly, Ser Thr)
* prolines rarer in other kinds of secondary structure

How do you measure accuracy of structure prediction
Accuracy of prediction is measure by the Q3 measure (per residue prediction accuracy) or SOV, segment overlap value (per segment prediction accuracy). Q3 give the percentage of correctly predicted residues in ?, ? and other states. SOV tells how the secondary structural elements have been predicted. It measures:

* number of segments in proteins
* average segment length
* distribution of number of segments with length

First generation structure prediction tools were knowledge-based. They used single residue statistics, databases of limited size, and preferred particular residues for certain secondary structure elements. Overall, they have < 55% Q3 accuracy.

The second generation structure prediction tools use machine learning. They use larger database, and produce segment based statistics and take neighbors into account. The algorithms used statistical information, sequence patterns, neural networks, etc. ALB, COMBINE, and GORIII are such methods which have < 55% Q3 accuracy.

The third generation structure prediction tools use evolutionary information as well. PHD and PSIPRED have ~ 75% +/- 11 Q3 accuracy.

What are secondary structure prediction tools useful for?
* Chain tracing
* Starting point for 3D structure modelling
* Fold recognition
* Homology modelling
* Functional assignment

What is homology modeling?
The ultimate goal of protein modeling is to predict a structure from its sequence with a accuracy that is comparable to the best results achieved experimentally. The idea behind homology modeling is to use experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target). Homology modeling is based on two observations:

* Protein structure is entirely determined by its amino acid sequence
* Structure is more stable than sequence over evolutionary periods so similar sequence usually fold into similar structures

Homology modeling involves following steps:
* target
* template selection
* alignment
* model building
* model evaluation

How accurate or reliable is homology modeling?
Homology models are classified into 3 area in terms of their accuracy and reliability.

* Midnight Zone Less than 20% sequence identity. The structure cannot reliably be used as a template.
* Twilight Zone 20% - 40% sequence identity. Sequence identity may imply structural identity.
* Safe Zone 40% or more sequence identity. It is very likely that sequence identity implies structural identity.

How good can homology modeling be?
* 60 - 100% Comparable to medium resolution NMR substrate specificity
* 30 - 60% Comparable to molecular replacement in crystallography. Support site-directed mutagenesis through visualization
* < 30% Serious errors

What is ab initio modeling?
Ab initio structure prediction seeks to predict the native conformation of a protein from amino acid sequence alone. Comparative modeling depends on finding a suitable template structure. In the absence of a suitable structure, ab initio prediction is the only method. A typical procedure would be to define a mathematical representation of a polypeptide chain and the surrounding solvent, define an energy function that accurately represents the physiochemical properties of proteins and use and algorithm to search for a chain conformation which possesses the minimum free energy. The problem with ab initio methods is that even short polypeptide chains can fold into a potentially infinite number of structures.

How do we quantify similarity between molecules?
RMSD
Small RMSD, many atoms = good alignment.

Structural alignment vs. structural superposition
Superposition assumes that one knows of at least some residues that match between protein structures A and B. Easy problem with exact solution. RMSD with translation and rotation. O(n) complexity.
Structural alignment: we don’t know. Must determine which atoms to align. NP-hard problem.

How can we solve structural alignment problems?
Two approaches:
1. compare two proteins directly
2. compare structural features of each protein separately
Most methods are able to identify obvious similarities easily.

Steps in structural alignment
1. Structural description of protein A and B
2. optimize the alignment between A and B
3. Measure the statistical significance of alignment against some random set of comparison

Alignment algorithms
1. Point-based methods use points (distances) to establish correspondences
2. Secondary structure based methods use vectors representing secondary structures to establish correspondences

What are the advantages and disadvantages of the distance matrix approach
Advantages
- invariant with respect to rotation and translation
Disadvantages
- the distance matrix is O(n2) for a protein with n residues
- comparing distance matrix is a hard problem
- insensitive to chirality

Structal
Uses dynamic programming (DP) iteratively to refine an arbitrary starting alignment.
STEPS:
1. Start with any set of correspondences between two structures (sequence
alignment, secondary structure alignment…).
2. Compute a score matrix by computing a score between all pairs of points
based on their distance.
3. Trace back through the score matrix to find a new set of correspondences that maximizes the score (standard DP)
4. Iterate 2 and 3 until score doesn’t change.
Note: The method is heuristic ? no guarantees of success, depends on quality
of starting structure.

Dali - Distance Alignment
Uses distance matrix to find similar patterns of distances, indicating correspondences.
Find all pairs of matching hexapeptides in the two proteins to be compared
Concatenate matching hexapeptides using simulated annealing.
Aim: Maximize the number of atoms, minimize RMSD
The assembly step is done in a random fashion, the search space is too large

SARF2
Uses vectors associated with secondary structures to do quick screen for similar structures, followed by refinement of distances.
1. Identify secondary structure elements (SSEs) and represent them as vectors.
2. Compare between pairs of SSEs in different proteins
Pairs of SSEs having the same orientation are marked
Orientation between 2 SSEs is defined by 5 parameters
3. Search for the largest ensemble of the compatible pairs of SSEs
NP-complete problem, done by graph theory algorithms (maximum clique)
4. Extension and refinement of the match

CE - dynamic programming
COMPARER - secondary structure and hydrophobic clusters
VAST - secondary structure, monte carlo

How can we evaluate structural alignments?
1. Number of amino acid correspondences created.
2. RMSD of corresponding amino acids
3. Percent identity in aligned residues
4. Number of gaps introduced
5. Size of the two proteins
6. Conservation of known active site environments …

Most statistical significance score are based on geometric properties. Some measure protein similarity. If the z-score is greater than 3, then the two structures are significantly similar (assuming normal distribution).

Multiple structure alignment servers
MASS, multiprot, SSM, Mammoth