Re-emphasise single letter code (more sensitive alignment)
- 0.Chemical Structure
- Diagrams, note hydrogens, acid + base, RASMOL samples,
peculiarity of Glycine and Proline, rote learning
- 1.Size
- smallest to largest, relationship to acceptance of
mutations, influence on flexibility for Glycine
- 2.Charge
- D, E negative; H, K, R positive. Effect of pH,
i.e.pKa (esp. H). H-bonding & solvent. Salt-bridges.
- 3.Hydrophobic
- A, V, L, I, M definitely. Simple meaning of term,
Also F, P, W, C, Y, T, G, and R to some extent; each
needs discussing.
- 4.Aromatic
- F, Y, W - chemical stability of benzene ring, electron
delocalisation, ring currents(?), interactions to pi
clouds.
- 5.Polar
- N, Q, S, T, also Y, W - hydrogen bonding donor/acceptor
interactions to solvent, and to m/c & s/c.
- 6.Conformationally unusual
-
- G - flexible due to lack of steric hindrance
- P - inflexible due to s/c bonding back phi~-60 degrees
- C - often involved in disulphide bridges
- 8.Chirality
- Point up difference between L- and D- amino acids
and also that several carbon atoms within side chains are
chiral. Tetrahedral geometry
- 9.The Disulphide Bond
- Explain that nearby cysteines can be oxidised to form
disulphides, which may improve stability. Bridges are
common in extracellular proteins (and ER lumen) but rare in
cytosolic proteins. Converse is true for isolated thiols.
Summarise geometry. Refer to thiol-redox mechanisms &
proteins. ?canonical disulphide in Igs. Cross-linking.
Prediction from Sequence
Considerable effort has been expended in trying to predict
secondary and even tertiary structure given just the primary sequence
of a protein. Some of this work has achieved moderate success. One of
the earliest, and most important, attempts at secondary structure
prediction was that of
Chou and Fasman.
NOMENCLATURE
Bases, Nucleosides, and Nucleotides
The four bases of DNA (Adenine, Guanine, Thymine, and Cytosine) or of
RNA (Adenine, Guanine, Thymine, and Uracil) are referred to as Nucleosides when
they are combined with their corresponding sugar (ribose or deoxy-ribose) or as Nucleotides
when combined with both sugar and phosphate. Thus, in the cytosol we find nucleoside
mono-, di-, and tri-phosphates, and these are the building blocks of the polymeric
nucleic acids, as well as being involved in many aspects of biochemical metabolic processes.
Adenosine tri-phosphate (ATP) is perticularly important, being the main biochemical storage compound,
entering into many enzymic reactions to provide energy (which usually comes from the cleavage of
a phosphate moiety to produce ADP).
A,G,C,T, (or U in RNA) are used as single letter abbreviations for these bases, especially in the sequences of
DNA (and RNA).
Single and 3-Letter Codes for Amino Acids
All proteins are polymers of the 20 naturally occuring amino acids. They are listed here along with their
abbreviations :-
Alanine Ala A
Cysteine Cys C
Aspartic AciD Asp D
Glutamic Acid Glu E
Phenylalanine Phe F
Glycine Gly G
Histidine His H
Isoleucine Ile I
Lysine Lys K
Leucine Leu L
Methionine Met M
AsparagiNe Asn N
Proline Pro P
Glutamine Gln Q
ARginine Arg R
Serine Ser S
Threonine Thr T
Valine Val V
Tryptophan Trp W
TYrosine Tyr Y
It is important to take the time to commit the single letter code to memory, as it is
invariably used when comparing and aligning sequences of proteins.
Most are easily remembered
by their initial letters. Try and invent for yourself memorable mnemonic aids to remember the others.
Note that Cysteine and Methionine are the only two sulphur-containing AAs.
The Genetic Code
The relationship between the codons of nucleic acids, and the amino acids for which they code, is
embodied in the Genetic Code, (which is NOT universal since slight variations on it are found
in mitochondria and chloroplasts). The 64 possible triplets of bases in a codon, and the amino acid coded for
are shown in this table :-
First Second Position Third
Position ------------------------------------ Position
| U(T) C A G |
U(T) Phe Ser Tyr Cys U(T)
Phe Ser Tyr Cys C
Leu Ser STOP STOP A
Leu Ser STOP Trp G
C Leu Pro His Arg U(T)
Leu Pro His Arg C
Leu Pro Gln Arg A
Leu Pro Gln Arg G
A Ile Thr Asn Ser U(T)
Ile Thr Asn Ser C
Ile Thr Lys Arg A
Met Thr Lys Arg G
G Val Ala Asp Gly U(T)
Val Ala Asp Gly C
Val Ala Glu Gly A
Val Ala Glu Gly G
Note that in most cases sufficient coding is performed by the first
two bases, the third (or wobble) base playing a minor role.
Note also the STOP codons, which cause termination of translation by the ribosome.
Different organisms exhibit different statistical preferences of triplet codon usage, as well
as using the amino acids in widely varying proportions. See Of URFs and ORFs'
by Russell Doolittle, University Science Books (1986) ISBN 0-935702-54-7.
Reading Frames, URFs, and ORFs
A piece of DNA sequence may or may not code for a piece of a protein, depending on whether
it's part of a gene. If we obtain a stretch of sequence experimentally from genomic
DNA, then we can try and guess what it might possible code for by using the Genetic Code to
convert from bases to AAs. However, you should appreciate that there are three possible
reading frames which may be used, each one base out of step with the others, each of which
may give a believable stretch of protein sequence, thus :-
5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
may code for
5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
ThrTrpLeuSerTyrSerArgProHisSerArgSerIleCysSerGluLeuLeu
or
5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
HisGlySTPProMetLeuAspProIleValAlaLeuTyrAlaArgThrCysSTP
or
5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
ThrAlaAspArgCysSTPThrProSTPSerArgTyrMetLeuGluLeuVal
Indeed, if this just happens to be the complementary strand, rather than the coding strand, then
there are another three reading frames, making six in all.
Notice that only one of the sequences shown has NO STOP CODONS - this MAY indicate it's a
coding sequence. It's called an Open Reading Frame (ORF).
There are programs in the Staden
package and elsewhere that can use clues like this, and other more sophisticated
statistical measures, to find coding stretches in DNA sequences. When such stretches
are first found there's usually considerable doubt about which gene, if any, they belong to.
They are referred to as Unidentified Reading Frames (URFs).
am 6th.Feb'95
Back to the Top