Practical 1

Bioinformatics tools for protein analysis


The purpose of this exercise is to:

Practical remarks

This (as well as all other our computer-based practicals) exercise is at places quite complicated and requires good organization of your work. You will need to work in several computer windows simultaneously. For this purpose, you should organize your screen properly. You will need the following windows:

Your report

As you go through this and all other exercises, there will be tasks and assignments that you will be asked to complete. Your answers should be written in a Report Document that should be completed and presented on request. In addition to your answers to the assignments and problems, you are welcome to include in your report any other observations and results that you think are interesting, as well as comments and remarks regarding both positive and negative aspects of the exercise. The questions that you are explicitly asked to answer are shown in bold. If there are terms that you do not understand, it would be a good idea to include at the end of the Report Document a vocabulary, where with the help of the tutor or handbooks and Internet resources such as Wikipedia you would compile your own glossary. If you are including in your Report Document any graphics, please use moderate or low resolution to limit the size of the final document to 1 MB.


The exercise is prepared in such a way that you will be guided and instructed step-by-step what to do by this write-up. Should you have any difficulties or problems, there will be a tutor present in the computer lab ready to help you. You may also want to use the Consultation Hours (Monday, 11:00-12:00, Department of Crystallography) to discuss your problems.

Getting started

The website of the Center for Biocrystallographic Research (CBB) in Poznan ( is a convenient launchpad for a number of biocrystallography-related databases and servers. Familiarize yourself with the CBB. When was the Center created? What are the research projects pursued in the Center? In your opinion, which one is the most interesting? Why?

The global bioinformatics servers and databases used in this exercise

You will need to access via the Internet the following global bioinformatics resources:

HIV Protease

In the next part of the exercise you will be working with the amino acid sequence of a protein from the HIV retrovirus. This protein is an enzyme called protease. Proteases are hydrolytic enzymes that digest (cleave) other proteins. The cleavage takes place at a peptide bond, which is why proteases are a group of a larger family of enzymes called peptidases. Proteases that cleave far from the terminal peptide bonds of the substrate are called endopeptidases. If they remove one amino acid from the end, they are exopeptidases. Exopeptidases are divided into aminopeptidases and carboxypeptidases. The HIV protease is an endopeptidase. From the point of view of the catalytic mechanism, proteases are divided into aspartyl proteases (the active site contains two aspartic acid residues) also known as acid proteases because they usually work at very low pH (for instance, pepsin works best in hydrochloric acid of pH 2 in your stomach), serine proteases (the active-site nucleophile is a serine residue), cysteine proteases (the active-site nucleophile is a cysteine residue) and metalloproteases (they have a metal-cation-assisted mechanism). The HIV protease is an unusual aspartic protease. First, learn more about retroviruses and their proteases by reading this short article HIV Protease - function and structure. Please answer the questions posed in this article.

The exercise step-by-step

Step 1: Reference entry from PDB

You will be analyzing in this exercise the sequences (primary structure) of retroviral proteases. However, since retroviruses encrypt their proteins in large genes, encoding many proteins, it is not straightforward to find the isolated amino acid sequences corresponding to retroviral proteases in protein (or DNA) sequence databases. To circumvent this difficulty, we will consult first the PDB looking for entries reporting the structure of retroviral proteases. In those studies, the sequences of the investigated proteins have been usually precisely tailored to correspond strictly to the mature protease. Thus, although the use of full structural information is not strictly the purpose of this exercise, we will start by searching the PDB for the first determined structure of a retroviral protease.

Step 2: Sequence editing

Step 3: Sequence database text search

Step 4: Sequence database Blast search

Using your Sequence as a template, you will now search all the available genetic and protein sequence information, looking for similar sequences. This is a formidable task because the global sequence databases store many billions of sequence data. Yet this search can be performed very efficiently using the program Blast

Step 5: Sequence alignment using ClustalW

In your notepad, you have now many sequences in Fasta format. They should look something like this:


Step 6: Analysis of results

In the output, the sequences will be aligned according to the conservation of their corresponding residues, with the most similar sequences at the top and the least similar at the bottom. The degree of similarity is reflected by the darkness of the background: dark gray indicates identical residues, light gray similar, or identical but only in some sequences, white - lack of any similarity. While it is easy to detect identical residues, the assignment of "similarity" is more problematic. Evidently, aspartic acid (D) and glutamic acid (E) are similar, also aspartic acid (D) and asparagine (N) are similar. Residues with aliphatic side chain are also similar, for instance leucine (L), isoleucine (I), valine (V), alanine (A), but of course the degree of similarity decreases from L to A as the chain gets shorter and shorter. Ananlogously, residues with aromatic side chains are certainly similar. There is now a lot of knowledge and experience in this field, and this knowledge is usually expressed in the form of a matrix, where the pairwise similarities can be described numerically. In addition, below the aligned sequences, you will find characters like * (identity) and : or . highlighting the degree of similarity of all the compared sequences at each position. If there is no mark under a given column, no reliable homology could be detected at this position; even though the residues are typed in one column in such a place, this may be simply a consequence of a better agreement down- and up-stream in the sequence. Dashes within a sequence indicate a gap. Sequence alignment programs are not very happy about gaps and usually count a huge penalty for the necessity of such a break. This is because gaps must be connected with insertions or deletions of long chuncks of sequence, which are much less likely than a simple point mutation changing the identity of a given residue. (Sometimes such point mutations in the genomic sequence do not even show up in the protein sequence because of the degeneracy of the genetic code.) You can easily explore the Amino acid properties in the ClustalW results.


In this exercise you have learned: Can you spell-out and explain the abbreviations in the above list?



If you have finished your assignments quickly, take a few moments and test your knowledge of molecular biology by taking Swiss-Quiz from the ExPASy website. Alternatively, or in addition, you may try my PX (Protein Crystallography) Quiz. Unfortunately, it is a mixture of English and Polish questions.
Last modification: December 11, 2010 (MJ)