Analysis
of PDB entries
Purpose
of this exercise. The purpose of this exercise is to analyze the
statistical distribution of selected geometrical parameters of the
protein structures deposited in the Protein Data Bank (PDB),
compare the results with the geometrical targets of Engh and Huber
[Acta Cryst. A47, 392 (1991)] (EH) used as restraints
in the crystallographic refinements, and draw conclusions about the
accuracy and precision of protein structures determined by
crystallography at different levels of resolution.
Step
1
Selection
of PDB entries
- Open a new
Excel spreadsheet.
- Open the RCSB
PDB site in a browser window.
- Use the
“Advanced Search” tool to select about 10 protein structures
with the highest resolution (e.g. 0.5-0.8 Å). In another run,
you will repeat this exercise for medium-resolution structures (e.g.
2.00-2.01 Å), and later for low-resolution structures (e.g. 3.0-3.1 Å).
- Tabulate a
Custom Report, including Resolution, R-factor, and Rfree.
Step
2
Getting
the geometrical parameters
- For each of the
selected structures, click on its PDB code, and then select the
“Geometry” tab to produce a summary of various geometrical parameters
calculated for this structure (Bond Lengths, Bond Angles, Torsion
Angles).
- For each
parameter listed by “Geometry”, you will be able to see its minimum,
maximum, average value, and sample standard deviation, as well as the
EH standard with its standard deviation (uncertainty).
- By clicking on the number of instances
(Tot Num) of a selected parameter, for example, of the C-N bond, you will generate a
listing of all the observations of this specific parameter in that
structure. (Note, that some parameters are listed separately for
different residues; for instance, the length of the peptide bond is
listed separately for Aaa-Pro peptides, as C-N(P), and separately for
all other peptides, as C-N.)
- Copy the listed
parameters and paste them into a selected column of the Excel
spreadsheet.
- Repeat this
operation with all selected structures, appending in each case the
parameter-value list at the end of the current spreadsheet column.
Step
3
Calculating
the statistics
- Note the number
of entries in your Excel listing, and the column where the values
are stored.
- Using Excel
statistical functions, calculate the average value and sample mean for
your list of numbers.
- Sort the values
of the listed parameters in ascending order and note the minimum and
maximum value; then undo the sorting operation.
Step
4
Generating
a histogram
- Divide the
range of the values into nine intervals, and store the numbers defining
the ranges in ascending order in one column of your spreadsheet.
- Generate a
histogram for the distribution of the geometrical parameter that you
have been analyzing.
- Repeat this
exercise for medium- and then for low-resolution structures.
Step
5
Special
case of torsion angles
- Conduct the
same analysis for the ω torsion angle defined
around the peptide
bond (CαN-C-Cα).
- Note that the
peptide bonds have to be analyzed separately for trans peptides (ω
close to 180º) and for cis
peptides (ω close to 0º).
- Note that while
the convention requires torsion angle to take values only from -180º to
+180º, the PDB listings sometimes show
non-conventional values (larger than +180º
or smaller than -180º). They
have to be converted to the conventional range.
- Note that
simple statistics fails for planar torsion angles. For instance, if you
have two torsion angles nearly equal 180º,
e.g. -179º and +179º, their mean value is incorrectly calculated as 0º.
In order to
be able to use standard statistical functions, you have to transform the
ω torsion angles to a
contiguous range, by converting the negative values to their positive
equivalents (ω') within the 0-360º range. A simple formula to do this
is as follows: ω'=[1-sign(ω)]∙180+ω.
Mariusz Jaskolski,
02.09.2006