tools for protein analysis
The purpose of this
exercise is to:
you an idea of the existing global databases storing information
about protein sequences
you some preliminary experience with the use of computer tools for
protein sequence analysis
you with the basic biochemical facts about the HIV retrovirus
the sequence of the protease from HIV-1 and other retroviruses
- Prepare you for the next Practicals, that will concentrate on
structural aspects of proteins
This (as well as all other
our computer-based practicals) exercise is at places quite
complicated and requires good organization of your work. You will
need to work in several computer windows simultaneously. For this purpose,
you should organize your screen properly. You will need the following
browser window with the text of the exercise (this window)
browser window, where you will be able to access recommended reading
(an introduction to HIV protease, or recommended papers)
browser window for access to URL locations recommended in the
exercise (for convenience, it may be sometimes necessary to open
more than one such window)
notepad for jotting notes to be used later in the exercise (e.g.
pasting amino acids sequences for later use)
text editor window, where you will be writing your report from this
you go through this and all other exercises, there will be tasks and
assignments that you will be asked to complete. Your answers should
be written in a Report Document that should be completed and presented on request. In addition to your answers to the assignments and
problems, you are welcome to include in your report any other
observations and results that you think are interesting, as well as
comments and remarks regarding both positive and negative aspects of
the exercise. The questions that you are explicitly asked to answer
are shown in bold. If there are terms that you do not understand, it
would be a good idea to include at the end of the Report Document a
vocabulary, where with the help of the tutor or handbooks and
Internet resources such as Wikipedia you would compile your own
glossary. If you are including in your Report Document any graphics,
please use moderate or low resolution to limit the size of the final
document to 1 MB.
exercise is prepared in such a way that you will be guided and
instructed step-by-step what to do by this write-up. Should you have any difficulties
or problems, there will be a tutor present in the computer lab ready
to help you. You may also want to use the Consultation Hours (Monday,
11:00-12:00, Department of Crystallography) to discuss your problems.
website of the Center for Biocrystallographic Research (CBB) in
Poznan (www.man.poznan.pl/CBB) is a convenient launchpad for a number
of biocrystallography-related databases and servers. Familiarize
yourself with the CBB. When was the Center created? What are the research
projects pursued in the Center? In your opinion, which one is the
most interesting? Why?
The global bioinformatics servers and databases used in this exercise
You will need to access via the Internet the following global bioinformatics resources:
- The Entrez server at NCBI (National Center for Biotechnology Information) of the NIH (National Institutes of Health) in the USA.
It allows you to launch searches within a number of databases.
- One of them is PubMed, a database of biomedical literature citations and abstracts. You will be able to search for
articles of interest using, for example, author [au] and date (year) of publication [dp] keywords
- The ExPASy (Expert Protein Analysis System) server of the Swiss Institute of Bioinformatics (SIB). From ExPASy you will be able to search
- the UniProt Knowledgebase, which contains the
- Swiss-Prot (curated) and TrEMBL (computer annotated) databases of protein sequences developed by SIB and EBI (European Bioinformatics Institute)
of the EMBL (European Molecular Biology Laboratory)
- The RCSB Protein Data Bank (PDB) containing experimentally determined atomic coordinates of all publicly accessible macromolecular structures
- A server for translation of amino acid code from 3- to 1-letter 3-to-1
In the next part of the
exercise you will be working with the amino acid sequence of a
protein from the HIV retrovirus. This protein is an enzyme called
protease. Proteases are hydrolytic enzymes that digest (cleave) other
proteins. The cleavage takes place at a peptide bond, which is why
proteases are a group of a larger family of enzymes called
peptidases. Proteases that cleave far from the terminal peptide bonds
of the substrate are called endopeptidases. If they remove one amino
acid from the end, they are exopeptidases. Exopeptidases are divided
into aminopeptidases and carboxypeptidases. The HIV protease is an
endopeptidase. From the point of view of the catalytic mechanism,
proteases are divided into aspartyl proteases (the active site
contains two aspartic acid residues) also known as acid proteases
because they usually work at very low pH (for instance, pepsin works
best in hydrochloric acid of pH 2 in your stomach), serine proteases
(the active-site nucleophile is a serine residue), cysteine proteases
(the active-site nucleophile is a cysteine residue) and
metalloproteases (they have a metal-cation-assisted mechanism). The
HIV protease is an unusual aspartic protease. First, learn more about
retroviruses and their proteases by reading this short article HIV
Protease - function and structure. Please answer the questions posed in this article.
The exercise step-by-step
Step 1: Reference entry from PDB
You will be analyzing in this exercise the sequences (primary structure) of retroviral proteases. However, since retroviruses encrypt their proteins in large genes, encoding many proteins,
it is not straightforward to find the isolated amino acid sequences corresponding to retroviral proteases in protein (or DNA) sequence databases. To circumvent this difficulty, we will consult first the
PDB looking for entries reporting the structure of retroviral proteases. In those studies, the sequences of the investigated proteins have been usually precisely tailored to correspond strictly to the mature protease.
Thus, although the use of full structural information is not strictly the purpose of this exercise, we will start by searching the PDB for the first determined structure of a retroviral protease.
- Using the Advanced Search of the PDB website, search for the first RSV protease structure. You may use Author=Wlodawer and Deposit Date in 1989.
- You may Sort the Results by Release Date
- Note the PDB accession code of the structure of interest
- Click on the "document" icon to get the original file with the deposit. In the end it contains a large block of data with the atomic coordinates corresponding to this structure
- However, you are interested in the records marked SEQRES (sequence)
- Copy the amino acid sequence to the right mouse button; please note that the sequence may be repeated in the SEQRES records several times (attributed to different chains identified
as A, B, etc.) if there are several protein chains in the crystallographic Asymmetric Unit (ASU), for instance because the protein forms oligomers (has quaternary structure)
Step 2: Sequence editing
- In a new browser window open the 3-to-1 amino acid code translator. Paste your 3-letter coded amino acid sequence in the program's window. Additionally you may also want to paste it in your notepad.
- Edit the pasted sequence to remove any unwanted text, e.g. the "SEQRES" labels. Only the sequence and blank spaces can remain.
- Click the Three...->One... button.
- Copy the 1-letter sequence (do NOT copy any asterisks, etc.)
- This will be your reference sequence in case of any doubts in the subsequent search of the UniProt database
Step 3: Sequence database text search
- Go to ExPASy.
- In the Swiss-Prot/TrEMBL search window type Rous sarcoma virus
- You can narrow the number of hits by specifying that virus is an
organism and protease protein name but it may be not necessary
- Click the Accession number of the first entry that appears to contain the protease sequence
- Read carefully the contents of this entry; if you find in the Sequence
annotation section a subentry for a
protease Chain, that's great! You've found a retroviral protease amino acid sequence in the sequence databases that contain the information
about many millions of genes, and store the information about many billions of amino acids in their sequences
- The identified protease sequence should be highlighted. If the search has been really successful, this sequence should be about 100 residues long. Remember this criterion: if at any point you find a"retroviral protease"
with a number of residues that is significantly different - something is amiss!
- Copy to your notepad the 1-letter code from the Sequence window, together with the important ">xxx..." line, which is a simple annotation. This style uses the so-called Fasta format. You can
put any other/additional comment on the ">xxx..." line, for instance "RSV PR".
- In the Hits pull-down menu, ask for 500 Hits.
Step 4: Sequence database Blast search
Using your Sequence as a template, you will now search all the available genetic and protein sequence information, looking for similar sequences. This is a formidable task because the global sequence databases store many billions of sequence data. Yet this search
can be performed very efficiently using the program Blast
- Make sure there are 500 Hits selected.
- Press Blast
- You can Customize display asking to display 100 hits on one page.
- You will now have a difficult (tedious) task of selecting retroviral protease
sequences from among the hits. Many entries will repeat essentially the same sequence (for instance different HIV-1 strains isolated from different AIDS patients). You want to
select only one HIV-1 hit, but one that has the complete protease sequence included. Likewise with other retroviruses. The annotated sequences (gold star) are to be preferred.
- As a suggestion, look for Avian leukosis virus (ALV), Avian myeloblastosis associated virus (MAV), Lymphoproliferative disease virus (LDV), Simian retrovirus (SRV), Mason-Pfizer monkey virus (M-PMV), Feline immunodeficieny virus (FIV), Mouse
mammary tumor virus (MMTV), Simian immunodefficiency virus (SIV), HIV-1, Human
T-cell leukemia virus (HTLV-1)
- Most of the hits correspond to (slightly) different genetic sequences of the HIV-1 virus. You can have an idea what a tremendous amount of sequence data is being fed to the database on an almost hourly basis!
- Some of our hits look strange (too long!), for instance from SRV or M-PMV. We will see what's going on later.
- In all cases that look suspicious (e.g. SIV, FIV) try to verify the protease sequence through reference to PDB (see Step 1).
- Unfortunately, among the 500 Blast hits there will be important sequences still missing. As a suggestion, you should run UniProt text searches as in Step 3 for Equine infectious anemia virus protease (EIAV PR).
Step 5: Sequence alignment using ClustalW
In your notepad, you have now many sequences in Fasta format. They should look something like this:
- You will now align those sequences using the program ClustalW available from the ExPASy site.
- Select the Align tab.
- Copy the sequences to the UniProt window and press Align
Step 6: Analysis of results
In the output, the sequences will be aligned according to the conservation of their corresponding residues, with the most similar sequences at the top and the least similar at the bottom. The degree of similarity is reflected by
darkness of the background: dark gray indicates identical residues, light gray similar, or identical but only in some sequences, white - lack of any similarity. While it is easy to detect identical residues, the assignment of "similarity" is
more problematic. Evidently, aspartic acid (D) and glutamic acid (E) are similar, also aspartic acid (D) and asparagine (N) are similar. Residues with aliphatic side chain are also similar, for instance leucine (L), isoleucine (I), valine (V),
alanine (A), but of course the degree of similarity decreases from L to A as the chain gets shorter and shorter. Ananlogously, residues with aromatic side chains are certainly similar. There is now a lot of knowledge and experience in
this field, and this knowledge is usually expressed in the form of a matrix, where the pairwise similarities can be described numerically. In addition, below the aligned sequences, you will find characters like * (identity) and : or .
highlighting the degree of
similarity of all the compared sequences at each position. If there is no mark under a given column, no reliable homology could be detected at this position; even though the residues are typed in one column in such a place, this may be simply a consequence of a better
agreement down- and up-stream in the sequence. Dashes within a sequence indicate a gap. Sequence alignment programs are not very happy about gaps and usually count a huge penalty for the necessity of such a break. This is because gaps must be connected with
insertions or deletions of long chuncks of sequence, which are much less likely than a simple point mutation changing the identity of a given residue. (Sometimes such point mutations in the genomic sequence do not even show up in the protein sequence because of the
degeneracy of the genetic code.) You can easily explore the Amino acid properties in the ClustalW results.
- Looking at the alignment, decide which sequence is the least homologous. Note also the number of identities (*) and similarities (* plus : plus .).
- Now, eliminate this black sheep from the lot and run ClustalW again.
- Analyze the results as before. Perhaps you would like to do more prunning? But remember, you want to leave at least two sequences for alignment ;-).
- In a satisfactory final alignment note: (1) the most similar region(s), (2) the least similar regions. What is your interpretation of these regions?
- Identify the active site of the enzymes. What can you say about its conservation?
- Now select only two sequences for comparison, for instance RSV/HIV or HIV/HTLV.
- Run a pairwise sequence alignment, which is conceptually somewhat different from the multiple sequence alignment above.
- In the pairwise alignment output, count the number o identities nI (*) and the number of similarities nS (* plus : plus .). If you express these numbers as fractions relative to the number of residues in the shorter sequence (n1),
you will have the percentage indicators of identity (100*nI/n1 %) and similarity (100*nS/n1 %).
- In your particular example, do you think the compared sequences are very similar, not particularly similar, not similar at all? Explain!
In this exercise you have learned:
Can you spell-out and explain the abbreviations in the above list?
- About retroviruses, such as HIV-1, HIV-2, SIV, M-PMV, RSV (ASV), FIV, EIAV, HTLV
- About aspartic proteases, and about retroviral proteases in particular
- About global bioinformatics-related databases and servers, such as ExPASy, UniProt, Swiss-Prot, TrEMBL, PDB, Entrez, PubMed
- About some important bioinformatics-related instutions, such as NIH, NCBI, SIB, EBI, EMBL, RCSB
- About some programs for protein sequence analysis, such as Blast, Fasta, and ClustalW
- From the website of FEBS Journal, download the educational article Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures
(vol. 275, 1-21 (2008)) (also available here) and read it in preparation for the next Practical classes.
If you have finished your assignments quickly, take a few moments and test your knowledge of molecular biology by taking Swiss-Quiz from the ExPASy website. Alternatively, or in
addition, you may try my PX (Protein Crystallography) Quiz. Unfortunately, it is a mixture of English and Polish questions.
Last modification: December 11, 2010 (MJ)