Bioinformatic Approaches for Gene Finding

Year : 2012, Volume : 1, Issue : 1
First page : ( 65) Last page : ( 68)
Print ISSN : 2319-118X. Online ISSN : 2319-1198. Published online : 2012 April 1.

Banerjee Sanjukta^1,*, Akuli RamKrishna²

¹Research Scholar, Department of Biotechnology, Bidhan Chandra Krishi Vishwavidhyalaya, West Bengal, India

²Project Leader, Tata Consultancy Services, Kolkata, India

*Email id: sanju.banerjee03@gmail.com

Introduction

The 21st century has seen the announcement of the draft version of the human genome sequence. Model organisms have been sequenced in both the plant and animal kingdoms and, currently, more than 60 eukaryotic genome sequencing projects are underway. However, biological interpretation, that is, annotation, is not keeping pace with this avalanche of raw sequence data. There is still a real need for accurate and fast tools to analyse these sequences and, especially, to find genes and determine their functions. Unfortunately, finding genes in a genomic sequence is far from being a trivial problem. The widely used and recognised approach for genome annotation consists of employing, first, homology methods, also called as ‘extrinsic methods’ and, second, gene prediction methods or ‘intrinsic methods’. Indeed, it seems that only approximately half of the genes can be found by homology to other known genes or proteins (although this percentage is of course increasing as more genomes get sequenced). In order to determine the 50% of remaining genes, the only solution is to turn to predictive methods and to elaborate fast, accurate and reliable gene finders. With the development of genome sequencing for many organisms, more and more raw sequences need to be annotated. Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics. Two classes of methods are generally adopted: similarity-based searches and ab-initio prediction. Gene discovery in prokaryotic genomes is less difficult, due to the higher gene density typical of prokaryotes and the absence of introns in their protein coding regions. DNA sequences that encode proteins are transcribed into mRNA, and the mRNA is usually translated into proteins without significant modification. In eukaryotic organisms, it is a quite different problem from that encountered in prokaryotes. Transcription of protein-coding regions initiated at specific promoter sequences is followed by removal of noncoding sequences (introns) from pre-mRNA by a splicing mechanism, leaving the protein-encoding exons. Once the introns have been removed and certain other modifications to the mature RNA have been made, the resulting mature mRNA can be translated in the 50–30 direction, usually from the first start codon to the first stop codon. As a result of the presence of intron sequences in the genomic DNA sequences of eukaryotes, the ORF corresponding to an encoded gene will be interrupted by the presence of introns that usually generate stop codons.

There are two important aspects to any programme for gene identification:

One is the type of information used by the programme, and
the other is the algorithm that is employed to combine that information into a coherent prediction.

Three types of information are used in predicting gene structures: signals in the sequence, such as splice sites; content statistics, such as codon bias; and similarity to known genes. The first two types have been used since the early days of gene prediction, whereas similarity information has been used routinely only in recent years. One of the reasons that the accuracy of gene-prediction programmes have improved in the last few years is the enormous increase in the number of examples of known coding sequences. This much larger sample size allows for more reliable statistical measures to be developed, as well as a much greater likelihood of encountering a gene that is related to one that has been identified previously.

Top

Approaches for Gene Finding

Computational methodology for finding genes in a genome has evolved significantly over the last 20 years. Many approaches have been proposed to find genes in both prokaryotes and eukaryotes. These approaches mainly fall into three categories: homology-based approaches, Ab Initio approaches and comparative genomics approaches.

Homology-based Approaches

These approaches are based on the similarity of sequences. Given a library of sequences of other organisms, we search target sequence in this library and identify library sequences (known genes) that resemble the target sequence. In addition, we could compare the target sequence with Expressed Sequence Tags (ESTs) of the same organism to identify regions corresponding to processed mRNA. If the identified sequences are genes, the target sequence is probably (putatively) a gene. These approaches are able to find biologically relevant genes. However, they could not identify genes that code for proteins not already in the library. BLAST (Basic Local Alignment Search Tool) is a well-known search tool in this category.

Ab-Initio Approaches

Ab Initio gene finding searches for certain signals of protein coding genes. There are two types of organisms: prokaryotes and eukaryotes.

Prokaryotes have small genomes (0.5 ∼ 10,000,000 bp) and high coding density (>90%). There are no introns in prokaryotes. Gene finding for prokaryotes is relatively easy since prokaryote genes have specific signals such as transcription factor binding site and Pribnow box that are easy to identify. Moreover, the protein-coding sequence is a contiguous Open Reading Frame (ORF), starting with a start codon (ATG) and ending with a stop codon (TAG/TGA/TAA).

However, for eukaryote genes, Ab Initio gene finding is more difficult for the following reasons. First, genes are separated by large intergenic regions. Second, a gene is not contiguous. The gene is divided into exons and introns by the splicing mechanisms in eukaryotic cells. The split genes make it difficult to define ORFs. Second, the signals (e.g., promoters) are more difficult to identify than that in prokaryotes since these signals are more complex and unspecified. Two such signals are CpG islands and binding sites for a Poly-A tail.

Comparative Gene Finding

Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and non-coding regions of the genome. Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of non-coding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans.

GENSCAN's model makes use of statistical features of the genome under consideration, obtained from an annotated training set. More recently, a number of methods have been suggested that attempt to also make use of comparative data. They are based on the observation that the level of sequence conservation between two species depends on the function of the DNA; for example, the coding sequence is more conserved than the intergenic sequence. One such programme is Rosetta, which first computes a global alignment of two homologous sequences and then attempts to predict genes in both sequences simultaneously. A second is the conserved exon method that uses local conservation. The TWINSCAN programme is an extension of GENSCAN that additionally models a conserved sequence.

Markov Model-Based Algorithms: Several highly accurate prokaryotic gene-finding methods are based on Markov model algorithms. The GeneMark family includes two major programmes, called GeneMark and GeneMark.hmm. Analysis of DNA from any prokaryotic species without a pre-computed species-specific statistical model is enabled by a self-training programme, GeneMarkS. GeneMark uses a Bayesian formalism to assess the a posteriori probability that a given short fragment is part of a coding or non-coding region. These calculations are performed using Markov chain models.

The idea behind this is that there are specific correlations between adjacent nucleotides in chromosomal DNA sequences. Markov chains have shown to be appropriate in inferring the statistical description of the gene structure.

GeneMark is the oldest method based on Markov models. It does not offer high accuracy, because it lacks precision in determining the translation initiation codon. Markov chain model of the DNA sequences was first introduced in GeneMark. The initial success of GeneMark has paved the way for further research in this direction. GeneMark.hmm is designed to improve GeneMark in finding exact gene starts. Therefore, the properties of GeneMark.hmm are complementary to GeneMark. GeneMark.hmm uses GeneMark models of coding and non-coding regions and incorporates them into hidden Markov model framework.

Glimmer 3.0: The core of Glimmer is an Interpolated Markov Model (IMM), which can be described as a generalised Markov chain with variable order. After GeneMark introduces the fixed-order Markov chains, Glimmer attempts to find a better approach for modeling the genome content. The motivational fact is that the bigger the order of the Markov chain, the more non-randomness can be described. However, as we move to higher order models, the number of probabilities that we must estimate from the data increases exponentially.

AMI Gene: The reason for including AMIGene in the list of gene finders revised in this paper is that AMIGene can be very helpful in some cases. The interesting thing about AMIGene is that it serves as a substitute for manual curation, because it searches the most likely CoDing Sequences (CDSs) in the output of a GeneMark-like programme. AMIGene predicts the genome structure in the same way as GeneMark. In addition to that, AMIGene investigates codon usage patterns and relative synonymous codon usage in the predicted CDSs, using multivariate statistical technique of factorial correspondence analysis (FCA) and k-means clustering.

FGenesB: It is another Markov chain-based algorithm, claimed to be more accurate than GeneMarkS and Glimmer. Unlike them, it finds tRNA and rRNA genes, in addition to coding sequences. FGenesB annotates the genes, that is, identifies their functions by homology with protein databases. As the rRNA genes are highly conserved with evolution, FGenesB identifies them easily in the genome, by comparing them against bacterial and archaeal rRNA databases, using the Basic Local Alignment Search Tool (BLAST), which is described in the section for homology-based search.

Top

Future Challenges for the Gene-Prediction Field

To create better algorithms for identifying general, as well as tissue- or developmental-specific classes of promoters.
To achieve a greater understanding of CpG islands and methylation patterns.
To have a better characterisation of the splicing enhancers and silencers that mediate alternative splicing, to allow models to predict alternative exons or aberrant splicing events.
To identify short exons, and to predict very long exons, more accurately.
To identify non-translated exons.
To predict polyadenylation sites and transcriptional termination sites.
To identify mRNA features that is related to mRNA editing, nonsense-mediated decay, stability and transport.
To predict genes that encode non-coding RNAs.

Conclusion

With the many genome sequencing projects currently under way, and although there are still problems to be solved, the comparative genome approach seems to be a very promising approach not only in the field of gene prediction but also for the identification of regulatory sequences and the deciphering of the so-called junk DNA. Since gene prediction leads to a structural annotation of the genomes, which is then used for experimentation, it would be wise to weight the predictions by giving a confidence value for each predicted gene, from high for a gene whose full structure has been obtained in a non-ambiguous way using cognate cDNA data to low for a gene whose prediction totally depends on intrinsic approaches. Even if in this study we have just discussed programmes to detect protein coding genes, there is also an undetermined (but probably high) number of genes producing functional non-coding RNAs, which may be identified by genomic comparison. More interest is now devoted to such non-coding RNAs, and this probably stands among the main future directions in computational approaches for genome analysis.

Top

References
	Top Back
	Top Back
	Top Back
	Top Back
	Top Back
	Top Back
	Top Back
	Top Back

Agriculture
Applied Science/Technology
Biology
Botany
Business/Economics/Management
Chemistry
Civil Engineering
Commerce/Banking/Finance
Computers/Information Technology
Dental Science
Earthscience
Education
Engineering Mechanics/Materials
Environment
Health Science
Humanities
Library and Information Science
Management
Mathematics/Statistics
Medical Science
Nanotechnology
Nursing
Pharmacy
Physics
Social Science
Veterinary/Animal Sciences