PRIMARY STRUCTURE

Back to Main Index...last updated 23rd.April 1995

PRIMARY STRUCTURE OF PROTEINS

Nomenclature
Databases
The Amino Acids
and Their Properties
We hope to include here all the interesting facts that the groups have assembled about their amino acid.
So far we have Ala(A), Glu(E), Gly(G), Ile, Lys(K), Gln(Q), Arg(R), Met(T) Ser(S), Thr(T), Tyr(Y) Phe(F) and Val(V)
These mini-resources will be under development for the duration of the course, so keep coming back from time to time to learn more. May we encourage those groups who haven't yet managed to mount their page to do so. Please mail pps@www.cryst.bbk.ac.uk if you need help.

Nomenclature
(Genetic & Single-Letter Codes)Garry Myers has offered help here: Bases, reading frames, genetic code, 20 AAs, 3 and 1 letter codes, self-assessment form to test rote learning. Note sulphur in M, C.
Phil Bourne at Columbia has elected to do this sub-section
See also the comprehensive resource put together by Ethan Benatan and Cornelius Krasel for Assignment 2.: Advances in sequencing, HUGO, automatic recognition and translation of coding ORFs, protein and DNA databases, NBRF-PIR, EMBL, GENBANK, SwissProt, Japan DB, GDB, non-redundant

The Amino Acids

This is a nice resource at the Free University of Berlin

A Tutorial on the Amino Acids, by Sami Raza

Cooperman's version

A set from Alan Ward, Dept.Microbiology, University of Newcastle upon Tyne, UK

& Their Properties

Re-emphasise single letter code (more sensitive alignment)

0.Chemical Structure

Diagrams, note hydrogens, acid + base, RASMOL samples, peculiarity of Glycine and Proline, rote learning

1.Size

smallest to largest, relationship to acceptance of mutations, influence on flexibility for Glycine

2.Charge

D, E negative; H, K, R positive. Effect of pH, i.e.pKa (esp. H). H-bonding & solvent. Salt-bridges.

3.Hydrophobic

A, V, L, I, M definitely. Simple meaning of term, Also F, P, W, C, Y, T, G, and R to some extent; each needs discussing.

4.Aromatic

F, Y, W - chemical stability of benzene ring, electron delocalisation, ring currents(?), interactions to pi clouds.

5.Polar

N, Q, S, T, also Y, W - hydrogen bonding donor/acceptor interactions to solvent, and to m/c & s/c.

6.Conformationally unusual

G - flexible due to lack of steric hindrance
P - inflexible due to s/c bonding back phi~-60 degrees
C - often involved in disulphide bridges

8.Chirality

Point up difference between L- and D- amino acids and also that several carbon atoms within side chains are chiral. Tetrahedral geometry

9.The Disulphide Bond

Explain that nearby cysteines can be oxidised to form disulphides, which may improve stability. Bridges are common in extracellular proteins (and ER lumen) but rare in cytosolic proteins. Converse is true for isolated thiols. Summarise geometry. Refer to thiol-redox mechanisms & proteins. ?canonical disulphide in Igs. Cross-linking.

Prediction from Sequence

Considerable effort has been expended in trying to predict secondary and even tertiary structure given just the primary sequence of a protein. Some of this work has achieved moderate success. One of the earliest, and most important, attempts at secondary structure prediction was that of Chou and Fasman.

NOMENCLATURE

Bases, Nucleosides, and Nucleotides

The four bases of DNA (Adenine, Guanine, Thymine, and Cytosine) or of RNA (Adenine, Guanine, Thymine, and Uracil) are referred to as Nucleosides when they are combined with their corresponding sugar (ribose or deoxy-ribose) or as Nucleotides when combined with both sugar and phosphate. Thus, in the cytosol we find nucleoside mono-, di-, and tri-phosphates, and these are the building blocks of the polymeric nucleic acids, as well as being involved in many aspects of biochemical metabolic processes.

Adenosine tri-phosphate (ATP) is perticularly important, being the main biochemical storage compound, entering into many enzymic reactions to provide energy (which usually comes from the cleavage of a phosphate moiety to produce ADP).

A,G,C,T, (or U in RNA) are used as single letter abbreviations for these bases, especially in the sequences of DNA (and RNA).

Single and 3-Letter Codes for Amino Acids

All proteins are polymers of the 20 naturally occuring amino acids. They are listed here along with their abbreviations :-

Alanine		Ala	A
Cysteine	Cys	C
Aspartic AciD	Asp	D
Glutamic Acid   Glu	E
Phenylalanine	Phe	F
Glycine		Gly	G
Histidine	His	H
Isoleucine	Ile	I
Lysine		Lys	K
Leucine		Leu	L
Methionine	Met	M
AsparagiNe	Asn	N
Proline		Pro	P	
Glutamine       Gln     Q
ARginine	Arg	R
Serine		Ser	S
Threonine	Thr	T
Valine		Val	V
Tryptophan	Trp	W
TYrosine	Tyr	Y

It is important to take the time to commit the single letter code to memory, as it is invariably used when comparing and aligning sequences of proteins.

Most are easily remembered by their initial letters. Try and invent for yourself memorable mnemonic aids to remember the others. Note that Cysteine and Methionine are the only two sulphur-containing AAs.

The Genetic Code

The relationship between the codons of nucleic acids, and the amino acids for which they code, is embodied in the Genetic Code, (which is NOT universal since slight variations on it are found in mitochondria and chloroplasts). The 64 possible triplets of bases in a codon, and the amino acid coded for are shown in this table :-

First               Second Position             Third
Position   ------------------------------------ Position
  |            U(T)    C       A       G         |
  
  U(T)        Phe     Ser     Tyr     Cys        U(T)
              Phe     Ser     Tyr     Cys        C
              Leu     Ser     STOP    STOP       A
              Leu     Ser     STOP    Trp        G

  C           Leu     Pro     His     Arg        U(T)
              Leu     Pro     His     Arg        C
              Leu     Pro     Gln     Arg        A
              Leu     Pro     Gln     Arg        G

  A           Ile     Thr     Asn     Ser        U(T)
              Ile     Thr     Asn     Ser        C
              Ile     Thr     Lys     Arg        A
              Met     Thr     Lys     Arg        G

  G           Val     Ala     Asp     Gly        U(T)
              Val     Ala     Asp     Gly        C
              Val     Ala     Glu     Gly        A
              Val     Ala     Glu     Gly        G

Note that in most cases sufficient coding is performed by the first two bases, the third (or wobble) base playing a minor role.

Note also the STOP codons, which cause termination of translation by the ribosome.

Different organisms exhibit different statistical preferences of triplet codon usage, as well as using the amino acids in widely varying proportions. See Of URFs and ORFs' by Russell Doolittle, University Science Books (1986) ISBN 0-935702-54-7.

Reading Frames, URFs, and ORFs

A piece of DNA sequence may or may not code for a piece of a protein, depending on whether it's part of a gene. If we obtain a stretch of sequence experimentally from genomic DNA, then we can try and guess what it might possible code for by using the Genetic Code to convert from bases to AAs.

However, you should appreciate that there are three possible reading frames which may be used, each one base out of step with the others, each of which may give a believable stretch of protein sequence, thus :-

          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
may code for 
          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
             ThrTrpLeuSerTyrSerArgProHisSerArgSerIleCysSerGluLeuLeu

or
          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
              HisGlySTPProMetLeuAspProIleValAlaLeuTyrAlaArgThrCysSTP

or
          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
               ThrAlaAspArgCysSTPThrProSTPSerArgTyrMetLeuGluLeuVal

Indeed, if this just happens to be the complementary strand, rather than the coding strand, then there are another three reading frames, making six in all.

Notice that only one of the sequences shown has NO STOP CODONS - this MAY indicate it's a coding sequence. It's called an Open Reading Frame (ORF).

There are programs in the Staden package and elsewhere that can use clues like this, and other more sophisticated statistical measures, to find coding stretches in DNA sequences. When such stretches are first found there's usually considerable doubt about which gene, if any, they belong to.

They are referred to as Unidentified Reading Frames (URFs).

am 6th.Feb'95 Back to the Top

PRIMARY STRUCTURE