Practical 1

Bioinformatics tools for protein analysis

Aims

The purpose of this exercise is to:

Give you an idea of the existing global databases storing information about protein sequences

Give you some preliminary experience with the use of computer tools for protein sequence analysis

Familiarize you with the basic biochemical facts about the HIV retrovirus

Analyze the sequence of the protease from HIV-1 and other retroviruses

Prepare you for the next Practicals, that will concentrate on structural aspects of proteins

Practical remarks

This (as well as all other our computer-based practicals) exercise is at places quite complicated and requires good organization of your work. You will need to work in several computer windows simultaneously. For this purpose, you should organize your screen properly. You will need the following windows:

A browser window with the text of the exercise (this window)

A browser window, where you will be able to access recommended reading (an introduction to HIV protease, or recommended papers)

Another browser window for access to URL locations recommended in the exercise (for convenience, it may be sometimes necessary to open more than one such window)

A notepad for jotting notes to be used later in the exercise (e.g. pasting amino acids sequences for later use)

A text editor window, where you will be writing your report from this exercise

Your report

As you go through this and all other exercises, there will be tasks and assignments that you will be asked to complete. Your answers should be written in a Report Document that should be completed and presented on request. In addition to your answers to the assignments and problems, you are welcome to include in your report any other observations and results that you think are interesting, as well as comments and remarks regarding both positive and negative aspects of the exercise. The questions that you are explicitly asked to answer are shown in bold. If there are terms that you do not understand, it would be a good idea to include at the end of the Report Document a vocabulary, where with the help of the tutor or handbooks and Internet resources such as Wikipedia you would compile your own glossary. If you are including in your Report Document any graphics, please use moderate or low resolution to limit the size of the final document to 1 MB.

Help

The exercise is prepared in such a way that you will be guided and instructed step-by-step what to do by this write-up. Should you have any difficulties or problems, there will be a tutor present in the computer lab ready to help you. You may also want to use the Consultation Hours (Monday, 11:00-12:00, Department of Crystallography) to discuss your problems.

Getting started

The website of the Center for Biocrystallographic Research (CBB) in Poznan (www.man.poznan.pl/CBB) is a convenient launchpad for a number of biocrystallography-related databases and servers. Familiarize yourself with the CBB. When was the Center created? What are the research projects pursued in the Center? In your opinion, which one is the most interesting? Why?

The global bioinformatics servers and databases used in this exercise

You will need to access via the Internet the following global bioinformatics resources:

The Entrez server at NCBI (National Center for Biotechnology Information) of the NIH (National Institutes of Health) in the USA. It allows you to launch searches within a number of databases.

One of them is PubMed, a database of biomedical literature citations and abstracts. You will be able to search for articles of interest using, for example, author [au] and date (year) of publication [dp] keywords

The ExPASy (Expert Protein Analysis System) server of the Swiss Institute of Bioinformatics (SIB). From ExPASy you will be able to search

the UniProt Knowledgebase, which contains the

Swiss-Prot (curated) and TrEMBL (computer annotated) databases of protein sequences developed by SIB and EBI (European Bioinformatics Institute) of the EMBL (European Molecular Biology Laboratory)

The RCSB Protein Data Bank (PDB) containing experimentally determined atomic coordinates of all publicly accessible macromolecular structures

A server for translation of amino acid code from 3- to 1-letter 3-to-1

HIV Protease

In the next part of the exercise you will be working with the amino acid sequence of a protein from the HIV retrovirus. This protein is an enzyme called protease. Proteases are hydrolytic enzymes that digest (cleave) other proteins. The cleavage takes place at a peptide bond, which is why proteases are a group of a larger family of enzymes called peptidases. Proteases that cleave far from the terminal peptide bonds of the substrate are called endopeptidases. If they remove one amino acid from the end, they are exopeptidases. Exopeptidases are divided into aminopeptidases and carboxypeptidases. The HIV protease is an endopeptidase. From the point of view of the catalytic mechanism, proteases are divided into aspartyl proteases (the active site contains two aspartic acid residues) also known as acid proteases because they usually work at very low pH (for instance, pepsin works best in hydrochloric acid of pH 2 in your stomach), serine proteases (the active-site nucleophile is a serine residue), cysteine proteases (the active-site nucleophile is a cysteine residue) and metalloproteases (they have a metal-cation-assisted mechanism). The HIV protease is an unusual aspartic protease. First, learn more about retroviruses and their proteases by reading this short article HIV Protease - function and structure. Please answer the questions posed in this article.

The exercise step-by-step

Step 1: Reference entry from PDB

You will be analyzing in this exercise the sequences (primary structure) of retroviral proteases. However, since retroviruses encrypt their proteins in large genes, encoding many proteins, it is not straightforward to find the isolated amino acid sequences corresponding to retroviral proteases in protein (or DNA) sequence databases. To circumvent this difficulty, we will consult first the PDB looking for entries reporting the structure of retroviral proteases. In those studies, the sequences of the investigated proteins have been usually precisely tailored to correspond strictly to the mature protease. Thus, although the use of full structural information is not strictly the purpose of this exercise, we will start by searching the PDB for the first determined structure of a retroviral protease.

Using the Advanced Search of the PDB website, search for the first RSV protease structure. You may use Author=Wlodawer and Deposit Date in 1989.
You may Sort the Results by Release Date
Note the PDB accession code of the structure of interest
Click on the "document" icon to get the original file with the deposit. In the end it contains a large block of data with the atomic coordinates corresponding to this structure
However, you are interested in the records marked SEQRES (sequence)
Copy the amino acid sequence to the right mouse button; please note that the sequence may be repeated in the SEQRES records several times (attributed to different chains identified as A, B, etc.) if there are several protein chains in the crystallographic Asymmetric Unit (ASU), for instance because the protein forms oligomers (has quaternary structure)

Step 2: Sequence editing

In a new browser window open the 3-to-1 amino acid code translator. Paste your 3-letter coded amino acid sequence in the program's window. Additionally you may also want to paste it in your notepad.
Edit the pasted sequence to remove any unwanted text, e.g. the "SEQRES" labels. Only the sequence and blank spaces can remain.
Click the Three...->One... button.
Copy the 1-letter sequence (do NOT copy any asterisks, etc.)
This will be your reference sequence in case of any doubts in the subsequent search of the UniProt database

Step 3: Sequence database text search

Go to ExPASy.
In the Swiss-Prot/TrEMBL search window type Rous sarcoma virus protease.
You can narrow the number of hits by specifying that virus is an organism and protease protein name but it may be not necessary
Click the Accession number of the first entry that appears to contain the protease sequence
Read carefully the contents of this entry; if you find in the Sequence annotation section a subentry for a protease Chain, that's great! You've found a retroviral protease amino acid sequence in the sequence databases that contain the information about many millions of genes, and store the information about many billions of amino acids in their sequences
The identified protease sequence should be highlighted. If the search has been really successful, this sequence should be about 100 residues long. Remember this criterion: if at any point you find a"retroviral protease" with a number of residues that is significantly different - something is amiss!
Copy to your notepad the 1-letter code from the Sequence window, together with the important ">xxx..." line, which is a simple annotation. This style uses the so-called Fasta format. You can put any other/additional comment on the ">xxx..." line, for instance "RSV PR".
In the Hits pull-down menu, ask for 500 Hits.

Step 4: Sequence database Blast search

Using your Sequence as a template, you will now search all the available genetic and protein sequence information, looking for similar sequences. This is a formidable task because the global sequence databases store many billions of sequence data. Yet this search can be performed very efficiently using the program Blast

Make sure there are 500 Hits selected.
Press Blast
You can Customize display asking to display 100 hits on one page.
You will now have a difficult (tedious) task of selecting retroviral protease sequences from among the hits. Many entries will repeat essentially the same sequence (for instance different HIV-1 strains isolated from different AIDS patients). You want to select only one HIV-1 hit, but one that has the complete protease sequence included. Likewise with other retroviruses. The annotated sequences (gold star) are to be preferred.
As a suggestion, look for Avian leukosis virus (ALV), Avian myeloblastosis associated virus (MAV), Lymphoproliferative disease virus (LDV), Simian retrovirus (SRV), Mason-Pfizer monkey virus (M-PMV), Feline immunodeficieny virus (FIV), Mouse mammary tumor virus (MMTV), Simian immunodefficiency virus (SIV), HIV-1, Human T-cell leukemia virus (HTLV-1)
Most of the hits correspond to (slightly) different genetic sequences of the HIV-1 virus. You can have an idea what a tremendous amount of sequence data is being fed to the database on an almost hourly basis!
Some of our hits look strange (too long!), for instance from SRV or M-PMV. We will see what's going on later.
In all cases that look suspicious (e.g. SIV, FIV) try to verify the protease sequence through reference to PDB (see Step 1).
Unfortunately, among the 500 Blast hits there will be important sequences still missing. As a suggestion, you should run UniProt text searches as in Step 3 for Equine infectious anemia virus protease (EIAV PR).

Step 5: Sequence alignment using ClustalW

In your notepad, you have now many sequences in Fasta format. They should look something like this:

>RSV PR
LAMTMEHKDR.....
>ALV PR
LAMTMEHKDR.....
>MLV PR
LAMTMEHKDR.....
>HIV-1 PR
PQITLWQRPL.....
>EIAV PR
VTYNLEKRPT.....
>SRV PR
SRKSLTTPSG.....
>SIV PR
PQFHLWKRPV.....
.....

You will now align those sequences using the program ClustalW available from the ExPASy site.
Select the Align tab.
Copy the sequences to the UniProt window and press Align

Step 6: Analysis of results

In the output, the sequences will be aligned according to the conservation of their corresponding residues, with the most similar sequences at the top and the least similar at the bottom. The degree of similarity is reflected by the darkness of the background: dark gray indicates identical residues, light gray similar, or identical but only in some sequences, white - lack of any similarity. While it is easy to detect identical residues, the assignment of "similarity" is more problematic. Evidently, aspartic acid (D) and glutamic acid (E) are similar, also aspartic acid (D) and asparagine (N) are similar. Residues with aliphatic side chain are also similar, for instance leucine (L), isoleucine (I), valine (V), alanine (A), but of course the degree of similarity decreases from L to A as the chain gets shorter and shorter. Ananlogously, residues with aromatic side chains are certainly similar. There is now a lot of knowledge and experience in this field, and this knowledge is usually expressed in the form of a matrix, where the pairwise similarities can be described numerically. In addition, below the aligned sequences, you will find characters like * (identity) and : or . highlighting the degree of similarity of all the compared sequences at each position. If there is no mark under a given column, no reliable homology could be detected at this position; even though the residues are typed in one column in such a place, this may be simply a consequence of a better agreement down- and up-stream in the sequence. Dashes within a sequence indicate a gap. Sequence alignment programs are not very happy about gaps and usually count a huge penalty for the necessity of such a break. This is because gaps must be connected with insertions or deletions of long chuncks of sequence, which are much less likely than a simple point mutation changing the identity of a given residue. (Sometimes such point mutations in the genomic sequence do not even show up in the protein sequence because of the degeneracy of the genetic code.) You can easily explore the Amino acid properties in the ClustalW results.

Looking at the alignment, decide which sequence is the least homologous. Note also the number of identities (*) and similarities (* plus : plus .).
Now, eliminate this black sheep from the lot and run ClustalW again.
Analyze the results as before. Perhaps you would like to do more prunning? But remember, you want to leave at least two sequences for alignment ;-).
In a satisfactory final alignment note: (1) the most similar region(s), (2) the least similar regions. What is your interpretation of these regions?
Identify the active site of the enzymes. What can you say about its conservation?
Now select only two sequences for comparison, for instance RSV/HIV or HIV/HTLV.
Run a pairwise sequence alignment, which is conceptually somewhat different from the multiple sequence alignment above.
In the pairwise alignment output, count the number o identities nI (*) and the number of similarities nS (* plus : plus .). If you express these numbers as fractions relative to the number of residues in the shorter sequence (n1), you will have the percentage indicators of identity (100*nI/n1 %) and similarity (100*nS/n1 %).
In your particular example, do you think the compared sequences are very similar, not particularly similar, not similar at all? Explain!

Summary

In this exercise you have learned:

About retroviruses, such as HIV-1, HIV-2, SIV, M-PMV, RSV (ASV), FIV, EIAV, HTLV

About aspartic proteases, and about retroviral proteases in particular

About global bioinformatics-related databases and servers, such as ExPASy, UniProt, Swiss-Prot, TrEMBL, PDB, Entrez, PubMed

About some important bioinformatics-related instutions, such as NIH, NCBI, SIB, EBI, EMBL, RCSB

About some programs for protein sequence analysis, such as Blast, Fasta, and ClustalW

Can you spell-out and explain the abbreviations in the above list?

Homework

From the website of FEBS Journal, download the educational article Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures (vol. 275, 1-21 (2008)) (also available here) and read it in preparation for the next Practical classes.

Fun

If you have finished your assignments quickly, take a few moments and test your knowledge of molecular biology by taking Swiss-Quiz from the ExPASy website. Alternatively, or in addition, you may try my PX (Protein Crystallography) Quiz. Unfortunately, it is a mixture of English and Polish questions.

Last modification: December 11, 2010 (MJ)