EST/CDS Guided Identification of Genes in Human Genomic DNA
GENIO/scan flow chart


by Niels Mache

1998, October, 28th

The human genome contains approximately 60,000 protein coding genes. Approximately 2-3% of the 3 billion nucleotide pairs are protein coding. Until now, a few thousand genes are identified. Finding genes, especially split genes, in genomic DNA is a strenuous task, even if similar gene sequences, proteins or EST/CDS sequences are known. Coding regions (exons) of genes in genomic DNA can be localized by alignment with various collections of mRNA, amino acid and CDS sequences if the investigated sequence has a significant similarity to known sequences. However genomic sequences with weak sequence similarity can not be characterized by sequence alignment alone. For the analysis of DNA sequences without homologies, gene prediction programs (GRAIL, GENSCAN, GENIE, etc.) are valuable tools.
In our approach, we combine gene prediction and sequence alignment. In the prediction stage we detect DNA binding sites and signals that are specific to the eukaryotic (human) gene structure. Known sites for DNA/protein and RNA/snRNP (small nucleoprotein particle) interactions are the regulative region, transcription and  translation initiation, donor, acceptor, branch point, polyA site and the translation termination. We detect the DNA binding sites by their positional information content (entropy). The coding potential of protein coding regions, i.e. exons is estimated by G+C dependent interleaved 6 tupel word entropy. In gene prediction programs the binding site/exon scoring is usually followed by an optimization step. Similar to the Viterbi algorithm, dynamic programming optimizes binding site scores and coding potential to maximum scored paths. These paths correspond to the models most likely gene structure. In GENIO/scan the query sequence is (gapped) aligned with EST databases and a special CDS database with the BLAST 2.0 program (current EST length distribution are shown here: EST length distribution, human EST length distribution). The resulting database hits are sorted by position, type of EST hit and e-value fitness. The chosen EST's are then fetched from the databases and overlapping EST's are assembled to contigs. This optional assembly step assembles consistent EST's and rejects inconsistent. In a second BLAST run the unmasked sequence is aligned with the contig database. In the following step a rule based inference engine generates gene structures that are consistent with prediction and alignment. A final Viterbi optimization detects the maximum scored exon paths. The database alignment improves GENIO/scan accuracy of gene prediction if one or more predicted exons are hit by an EST, especially if an EST aligns with multiple exons. Multi exon (i.e. intervening) EST hits determine in many cases the complete gene structure. The resulting gene structures can be more clearly identified than using conventional search and prediction methods alone.

This work is part of the joint research project "Computer Aided Automatic Sequence Analysis of the Human Genome", sponsored by the Federal Ministry of Education, Science, Research and Technology (BMBF), BMBF Förderkennzeichen FZK01KW9631/6.
Take a look at  GENIO/seq , the nonredundant eukaryotic gene sequence database and GENIO/logo, logo representation of positional information contents. The WWW page of GENIO/splice splice site and exon prediction is here.  The splice site prediction is based on  positional information content measurement of DNA sequences from GENIO/seq database.

Intelligent Systems for Molecular Biology, Montreal (ISMB-98)

  • ISMB-98 Poster (small GIF-File), (large GIF-File), (Postscript, zipped)
  • The GENIO suite: Patent and Licensing Agency (PLA), round table 4

    GENIO suite:

    GENIO/seq  - A Non-Redundant Eukaryotic Gene Database of Annotated Sites and Sequences
    GENIO/lookup - masked (eukaryotic) EST search with BLAST
    GENIO/cover  - GENIO/scan EST Coverage Analysis of Currently Known Eukaryotic Coding Sequences
    GENIO/logo  - Nucleic and Amino Acid Sequence Logos
    GENIO/splice - Splice Site and Exon Prediction in Human Genomic DNA
    GENIO/frame - Frame Shift Analysis and Sequencing Error Detection
    Important information: Due to security and privacy requirements your data sent as well as the data generated during GENIO request(s) will be erased automatically 15 minutes after your request. Please download your data files within 15 minutes.

    Problems, suggestions, remarks?
    Feel free to send an email to Niels Mache (mache at  (my home page is here)