PPS'97 Logo IndexIndex to Course Material IndexIndex to Section 2

Implications of Primary Structure

Primary Structure Determines All Higher Levels of Structure

The linear polypeptide chain folds in a particular arrangement, giving a three-dimensional structure. Proteins unfolded in vitro fold back to their original ("native") state when solution conditions are returned to those in which the folded protein exists. All the information for the native fold appears therefore to be contained within the primary structure: proteins are self-folding (although in vivo, polypeptide folding is often assisted additional molecules known as molecular chaperones).


There is a practically infinite number of different possible primary structures; this is the basis for the great diversity of three-dimensional structures, and functions, of proteins. Consider how many different primary structures are possible for a polypeptide of 200 residues (which is in fact a relatively small polypeptide in terms of those found in nature; polypeptides of over 1000 residues exist).

From the size of the number above (how does it compare with the number of particles in the known universe?), it is apparent that only a fraction of possible primary structures exist (or have ever existed).

As a corollory it is very unlikely that two proteins with similar amino acid sequences have independently evolved. Such similarities therefore indicate that the two proteins must be related and share a common ancestor. Related proteins are termed homologous.

Over evolutionary time-spans, proteins mutate: i.e. their primary structure becomes altered, generally by one amino acid at a time (although more drastic single modifications can also occur). Such alterations are caused by mutations in the genes (linear sequences of nucleotides) which encode them. The storage of genetic information, and how it is translated into protein primary structure, is covered in a later section of this course.

Not only point mutations (the substitution of one amino acid for another) occur; a protein sequence may lose some of its amino acids (deletion mutation) or have amino acids inserted (insertion mutation).

If two primary sequences are more than approximately 20% identical (making reasonable allowance for insertions and deletions) then they are assumed to be homologous.

The fact that two sequences which are to be compared may be of different lengths, and the need to allow for deletions and insertions, makes the optimal alignment (that alignment which gives the closest match, i.e. the smallest number of differences) of the sequences a difficult task.

Generally, a particular type of protein has the same, or a very similar sequence within one species of organism. However there are cases of polymorphism, where several different functional sequences exist for a given type of protein within the population.

The Three-dimensional Structure of Proteins is More Highly Conserved than the Primary Structure

The 3D structure is stabilized by a multitude of specific interactions between the various chemical groups present.

If the differences between two homologous species are examined, a general tendency is observed for chemically similar amino acid residues to be found at the same position. The substitution, for example, of one acidic residue (e.g. Glu for Asp) is likely to be of less consequence to the interactions with nearby residues than would the substitution of Glu for Val, a hydrophobic residue.

This tendency is summarized in this Dayhoff Matrix.

Mutations to dissimilar residues are more likely to lead to the 3D conformation being less stable, or even to the inability of the mutant polypeptide chain to ever fold. In such cases, the function of the protein is therefore impaired or disabled, which is likely to disadvantage the organism to some degree or other. The result is that such mutations tend to be lost from the population; they are selected against, while the 'neutral', or even advantageous, mutations persist (are 'fixed'). Of course, some mutations would be expected to be favourable, by altering the 3D structure such that it functions more efficiently.

Consequently, homologous proteins have similar 3D structures- the differences in primary structure do not result in a drastic rearrangement of the folded conformation. If they did, they would in most cases disappear from the population.

In the same way that different primary structures give rise to similar three-dimensional folds, different gene sequences can result in the same primary structures, as will become clear later. Thus, a relative scale of conservation of structure is as follows:

genes < protein primary structure < three-dimensional protein structure

The semiempirical Dayhoff matrix of similarity indices can be used in the alignment of two sequences, in order to detect homology. In such an alignment, the substitution of an amino acid by a 'dissimilar' residue incurs a larger penalty than the alignment of two 'similar' residues.

A note on nomenclature

During evolution, genes (the nucleotide sequences that code protein sequences) become cut, rearranged, joined together and duplicated. As a result, some gene products (i.e. proteins) contain sections with homology to several other proteins. Such a protein may be described as "partially homologous" to a number of related structures. For example the low-density lipoprotein (LDL) receptor is "partially homologous" to both C9 (a complement component of the biochemical cascade leading to blood-clotting) and to epidermal growth factor (EGF).

In the above case, a sub-sequence of the LDL-receptor is homologous to a sub-sequence of C9, and a different sub-sequence of the LDL-receptor is homologous to a sub-sequence of EGF. Strictly, two sequences, or two subsequences, may be either homologous, or they may be non-homologous; there are degrees of similarity between sequences, but no degrees of homology.

Convergent Evolution

Similar structural/functional designs appear to have evolved independently in some instances. For example the active sites of two families of functionally similar enzymes contain a specific arrangement of Ser, Asp and His residues. These families are the trypsin family (mammalian serine proteases) and the subtilisins (bacterial serine proteases). Even though the two utilise a similar mechanism dependent on this structure, the two are unrelated - i.e. their primary structures exhibit no evidence of homology.

The bacterial and mammalian lines diverged from a common ancestor very early in evolution and this is een as evidence for convergent evolution. On the other hand, for some proteins similarities between mammalian sequences have been found to be less than with corresponding proteins from a different class (for example, the hormone relaxin). This is seen as evidence for the swapping of functional "microdomains" very early on in evolution, and challenges the current paradigm (Schwabe,C., 1986 Trends in Biochemical Sciences 11 280-283).

The Importance of Establishing Protein Sequence Homologies

One reason that establishing homology is so important is that it is a major step towards determining the likely three-dimensional structure, and function, of an amino acid sequence.

In practice, primary structure is in fact more easily determined by interpreting a gene sequence of nucleotides (with reference to the genetic code), if it is known, than directly from a purified protein itself. The genetic code, and its translation, will be examined in a later section of the course.

The recent advances in recombinant DNA technology have led to an explosion in the number of gene sequences from many organisms. Analysis of these sequences to determine if any are homologous to sequences of known structure allows prediction of possible structure/functions.

Further Reading

The VSNS-Biocomputing Division course on sequence analysis offers an in-depth treatment of the computational approaches to this field. Although these techniques are beyond the scope of the PPS course, we hope that those of you who are particularly interested in the subject will feel encouraged to take part in future classes of the BCD course. We are also fortunate in having several of the organizers and authors of the BCD course as PPS consultants.

VSNS-BCD Homepage and Hypertext Coursebook.


The principles of sequence alignment apply not only to protein sequences but also to nucleotide sequences of DNA and RNA. You should be aware of the aims of pairwise alignment, and of multiple alignment (optimal alignment of more than 2 sequences); in addition you should understand the principle of 'evolutionary distances' between homologous proteins, which can be calculated from differences in their sequence. We shall be returning to the subject of sequence analysis, and the databases involved, later on in the course.

Last updated 9th Oct'96