The Genetic Code

The relationship between the codons of nucleic acids, and the amino acids for which they code, is embodied in the Genetic Code, (which is NOT universal since slight variations on it are found in mitochondria and chloroplasts). The 64 possible triplets of bases in a codon, and the amino acid coded for are shown in this table :-

First               Second Position             Third
Position   ------------------------------------ Position
  |            U(T)    C       A       G         |
  
  U(T)        Phe     Ser     Tyr     Cys        U(T)
              Phe     Ser     Tyr     Cys        C
              Leu     Ser     STOP    STOP       A
              Leu     Ser     STOP    Trp        G

  C           Leu     Pro     His     Arg        U(T)
              Leu     Pro     His     Arg        C
              Leu     Pro     Gln     Arg        A
              Leu     Pro     Gln     Arg        G

  A           Ile     Thr     Asn     Ser        U(T)
              Ile     Thr     Asn     Ser        C
              Ile     Thr     Lys     Arg        A
              Met     Thr     Lys     Arg        G

  G           Val     Ala     Asp     Gly        U(T)
              Val     Ala     Asp     Gly        C
              Val     Ala     Glu     Gly        A
              Val     Ala     Glu     Gly        G

Note that in most cases sufficient coding is performed by the first two bases, the third (or wobble) base playing a minor role.

Note also the STOP codons, which cause termination of translation by the ribosome.

Different organisms exhibit different statistical preferences of triplet codon usage, as well as using the amino acids in widely varying proportions. See Of URFs and ORFs' by Russell Doolittle, University Science Books (1986) ISBN 0-935702-54-7.

Reading Frames, URFs, and ORFs

A piece of DNA sequence may or may not code for a piece of a protein, depending on whether it's part of a gene. If we obtain a stretch of sequence experimentally from genomic DNA, then we can try and guess what it might possible code for by using the Genetic Code to convert from bases to AAs.

However, you should appreciate that there are three possible reading frames which may be used, each one base out of step with the others, each of which may give a believable stretch of protein sequence, thus :-

          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
may code for 
          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
             ThrTrpLeuSerTyrSerArgProHisSerArgSerIleCysSerGluLeuLeu

or
          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
              HisGlySTPProMetLeuAspProIleValAlaLeuTyrAlaArgThrCysSTP

or
          5'-acacggctgaccgatgctagaccccatagtcgcgctatatgctcgaacttgttaa-3'
               ThrAlaAspArgCysSTPThrProSTPSerArgTyrMetLeuGluLeuVal

Indeed, if this just happens to be the complementary strand, rather than the coding strand, then there are another three reading frames, making six in all.

Notice that only one of the sequences shown has NO STOP CODONS - this MAY indicate it's a coding sequence. It's called an Open Reading Frame (ORF).

There are programs in the Staden package and elsewhere that can use clues like this, and other more sophisticated statistical measures, to find coding stretches in DNA sequences. When such stretches are first found there's usually considerable doubt about which gene, if any, they belong to.

They are referred to as Unidentified Reading Frames (URFs).

Last updated 11th Nov '96