Links Celera Genentech NCBI Server Department University
 Search for

People

  • Director

    Email

  • Contact us!


     

  •  We have intercepted a message ...
    The completion of the sequencing of the human genome is often hailed as a great "medical" achievement. While it certainly represents an enormous triumph of mechanical effort, all we now have is a list of 3 billion letters - no drugs, no cures, no therapies! Merely having the sequence tells us nothing about the meaning of the genetic code and medical or biological science alone is not capable of answering this question. Understanding the genetic code is the province of information theory and it is in making sense of the code that the real achievements and benefits for mankind will lie.

     The GeneSys Project
    The GeneSys project is a long-term research programme aimed at bringing together the disciplines of mathematics, medicine and computing, in particular, the application of powerful insights from mathematical and engineering fields such as information theory, digital signal processing, error control coding and cryptography to ab initio phenomenological discovery in the genome. These include:

    • Linear and Non-Linear Spectral Analysis for the discovery of:
      • Structural, functional and regulatory regions
      • Reading frame alignment
      • Protein coding regions
      • Promoter sites
    • 1/f Noise Colour Analysis for the discovery of:
      • Long range structure and correlation
      • Higher level grammars
    • Coding Theory for the discovery of:
      • Statistical information measures such as rate, redundancy, etc.
      • Possible error correction mechanisms
      • Frameshift mutations (deletion)
    • Cryptography for the discovery of:
      • Underlying genetic "language"
      • Relationships to the proteome
    • Computational Linguistics for the discovery of:
      • High level structure
      • Grammar and syntax
    • Machine Learning Algorithms for the discovery of:
      • Long distance relationships
      • Predictability of sequences
     Key Results
    We have already demonstrated the statistical results for the rate and redundancy of the base nucleotides and the codon triplets. We were the first to show the key results that the genetic code is both instantaneous and optimal and that it exactly meets the Kraft-MacMillan bound for such codes. We have also shown by an information theoretic argument why nature chooses to use the 64 possible codons to code for precisely 20 amino acids. This is a highly significant result that says a lot about the optimisation of the genetic code by evolution. We have also originated the notion of an error-correction mechanism in the DNA replication process, an argument that strongly supports one of the central theses of Richard Dawkins' work on evolution.

    While these results are fundamental and vital, they are only the beginning.

     The Software
    We intend to implement the results of our research in powerful software tools for use by the wider genome research community. To date, we have completed the implementation of a very flexible and fast spectral analyser called GeneScan capable of handling sequences up to 4 Gbases in length (longer than the human genome!) with a throughput of 3 Kbases per second in reading frame discovery mode and 10 Kbases per second in synchronised mode! Just look at the difference in these two spectragrams between the inconclusive results with ordinary software and the super-resolution of GeneScan!

    We also have a software tool called GeneParser which uses Formal Language Theory to parse a genome into "words". A genome is a factorizble language by definition and GeneParser has had some extraordinary successes in isolating some biologically relevant sequences such as poly_A tails and transcription start sequences as "words" in the formal genomic language. Here is an exerpt from its output when analysing the Ebola virus genome - the poly_A's and tss's are obvious:

    {tgatgaagattaa} 13 4388 3030 1358
    {gatgaagattaag} 13 5899 4389 1510
    {tgaagattaagaaaaa} 16 8289 4391 3898
    {actaatgatgaagattaa} 18 9878 3025 6853
    {attaagaaaaaa} 12 11485 5882 5603
    18960 symbols
    2979 words
    18 symbols in longest word
    6.364552 symbols in average word 
    
    GeneParser is still in its infancy but we believe that it may prove to be one of the more exciting tools in the bioinformatics world!


    The GeneSys project would like to thank J Craig Venter and Celera Genomics for so generously making their genome database freely available for this research. The Celera genome is by far the most comprehensive and accurate to date.


    Updated: 8 March 2003