Analysis of PDB entries

**Purpose
of this exercise.** The purpose of this exercise is to analyze the
statistical distribution of selected geometrical parameters of the
protein structures deposited in the Protein Data Bank (**PDB**),
compare the results with the geometrical targets of **Engh and Huber**
[Acta Cryst. **A**47, 392 (1991)] (**EH**) used as restraints
in the crystallographic refinements, and draw conclusions about the
**accuracy and precision** of protein structures determined by
crystallography at different levels of **resolution**.

**Step
1**

**Selection
of PDB entries**

- Open a new Excel spreadsheet.
- Open the RCSB PDB site in a browser window.
- Use the
“Advanced Search” tool to select about 10
**protein**structures with the highest**resolution**(e.g. 0.5-0.8 Å). In another run, you will repeat this exercise for medium-resolution structures (e.g. 2.00-2.01 Å), and later for low-resolution structures (e.g. 3.0-3.1 Å). - Tabulate a Custom Report, including Resolution, R-factor, and Rfree.

**Step
2**

**Getting
the geometrical parameters**

- For each of the selected structures, click on its PDB code, and then select the “Geometry” tab to produce a summary of various geometrical parameters calculated for this structure (Bond Lengths, Bond Angles, Torsion Angles).
- For each parameter listed by “Geometry”, you will be able to see its minimum, maximum, average value, and sample standard deviation, as well as the EH standard with its standard deviation (uncertainty).
- By clicking on the number of instances
(
**Tot Num**) of a selected parameter, for example, of the C-N bond, you will generate a listing of all the observations of this specific parameter in that structure. (Note, that some parameters are listed separately for different residues; for instance, the length of the peptide bond is listed separately for Aaa-Pro peptides, as C-N(P), and separately for all other peptides, as C-N.) - Copy the listed parameters and paste them into a selected column of the Excel spreadsheet.
- Repeat this operation with all selected structures, appending in each case the parameter-value list at the end of the current spreadsheet column.

**Step
3**

**Calculating
the statistics**

- Note the number of entries in your Excel listing, and the column where the values are stored.
- Using Excel statistical functions, calculate the average value and sample mean for your list of numbers.
- Sort the values of the listed parameters in ascending order and note the minimum and maximum value; then undo the sorting operation.

**Step
4**

**Generating
a histogram**

- Divide the range of the values into nine intervals, and store the numbers defining the ranges in ascending order in one column of your spreadsheet.
- Generate a histogram for the distribution of the geometrical parameter that you have been analyzing.
- Repeat this exercise for medium- and then for low-resolution structures.

**Step
5**

**Special
case of torsion angles**

- Conduct the same analysis for the ω torsion angle defined around the peptide bond (CαN-C-Cα).
- Note that the peptide bonds have to be analyzed separately for trans peptides (ω close to 180º) and for cis peptides (ω close to 0º).
- Note that while the convention requires torsion angle to take values only from -180º to +180º, the PDB listings sometimes show non-conventional values (larger than +180º or smaller than -180º). They have to be converted to the conventional range.
- Note that simple statistics fails for planar torsion angles. For instance, if you have two torsion angles nearly equal 180º, e.g. -179º and +179º, their mean value is incorrectly calculated as 0º. In order to be able to use standard statistical functions, you have to transform the ω torsion angles to a contiguous range, by converting the negative values to their positive equivalents (ω') within the 0-360º range. A simple formula to do this is as follows: ω'=[1-sign(ω)]∙180+ω.

Mariusz Jaskolski, 02.09.2006