• Tidak ada hasil yang ditemukan

Directory UMM :Data Elmu:jurnal:P:PlantScience:Plant Science_BioMedNet:181-200:

N/A
N/A
Protected

Academic year: 2017

Membagikan "Directory UMM :Data Elmu:jurnal:P:PlantScience:Plant Science_BioMedNet:181-200:"

Copied!
6
0
0

Teks penuh

(1)

Genome data have to be converted into knowledge to be useful to biologists. Many valuable computational tools have already been developed to help annotation of plant genome sequences, and these may be improved further, for example by identification of more gene regulatory elements. The lack of a standard computer-assisted annotation platform for eukaryotic genomes remains a major bottle-neck.

Addresses

*†Laboratoire Associé de l’INRA (France), *†‡Department of Genetics,

Flanders Interuniversity Institute of Biotechnology, University of Ghent, KL Ledeganckstraat 35, B-9000 Gent, Belgium

*e-mail: Pierre.Rouze@gengenp.rug.ac.be

e-mail: Nathalie.Pavy@gengenp.rug.ac.bee-mail: Stephane.Rombauts@gengenp.rug.ac.be Current Opinion in Plant Biology1999, 2:90–95 http://biomednet.com/elecref/1369526600200090 © Elsevier Science Ltd ISSN 1369-5266

Abbreviations

AGI Arabidopsisgenome initiative

EST expressed sequence tag

Introduction

High throughput genome sequencing projects are produc-ing an enormous amount of raw sequence data. For plants, the complete sequence of the Arabidopsis thaliananuclear genome is expected for the year 2000; about one-third of the expected 120 Mb whole sequence is already available [1,2••] and sequencing the rice genome has begun (see

S Goff this issue pp 86–89). Unannotated sequences are not useful as such, except to find a given DNA sequence, because rich database queries cannot be done and the rel-evance of homology searches cannot be evaluated. What biologists are looking for is information derived from the genome sequence: markers, genes, mRNA or deduced protein sequence, not the sequence itself. To be meaning-ful to them, the genome sequences have to be converted into biologically significant knowledge; annotation is the first step toward knowledge acquisition. Contrary to a recent opinion [3], we believe that this step should better be done as soon as possible. Annotation is certainly a risky process and at present its result is far from optimal, even for well-documented bacterial genomes [4•]. But it is a

trade-off, with the benefit for the plant research communi-ty as the abilicommuni-ty to mine the sequence data more efficiently, and hence to plan classic and genome-wide experiments which would gain a wider biological knowledge more quickly, and would include a validation of the annotation itself. As pointed out by Meinke et al. [2••], a long list of

fundamental and technological research advances have already resulted from the Arabidopsis genome initiative (AGI); for a number of these advances the availability of annotated sequence played a important role.

For a genomic DNA sequence, annotation is the process by which an anonymous sequence is documented, by posi-tioning along the sequence the various sites and segments that are involved in the genome functionality. Many ele-ments of possible biological relevance that are not, or not yet, linked to a particular gene may well be annotated, such as matrix attachment regions (MARs) [5] and sequence repeats. Nevertheless, the basic objects in genome annota-tion are obviously the genes and the related elements that are involved in their structure, their expression, and their function or the function of their encoded products.

Here, we will particularly consider recent annotation tools that have been developed for plant nuclear genomes, and will also present tools and results obtained from other genomes, as much more information is avail-able from bacterial, yeast and mammalian (especially human) genome projects. Besides, plant nuclear genomes are not the only ones. The genomes from the many plant-interacting organisms for which genomics projects are ongoing [6], as well as organellar genomes, are useful sources of information for plant biologists.

Structural and functional annotation

Although gene annotation has been in existence as long as the sequence databases themselves, and is familiar to any biologists who has cloned and sequenced a particular gene, annotating genomic sequences is in fact a new kind of task. Many tools used for genomic sequences of individual genes are inadequate, firstly because of the size of the genome contig sequences. Sequence size in itself is often a problem for pre-genome software, together with the inability to deal with sequences containing several genes on both strands. Secondly, and more importantly, the sequence of a single gene often comes as a result of a care-fully designed experimental approach which supplies biological information, contrary to systematically obtained genome sequences [3]. Functional genomics analysis in plants [7•] should partially fill this gap by providing

genome-wide experimental data on function and expres-sion of genes, and will undoubtedly contribute to a much improved functional annotation in the near future.

Annotation is a two step process. The first step can be described as structural annotation and consists of finding the location of the biologically relevant sites and strings, ending up with a coherent model for the whole sequence where each object is properly defined and each object component has a unique location. As the major objects are genes, this step is often likened to gene finding, although it is a true annotation step. After structural annotation has been carried out, there is a generic description of the whole sequence, which can now be used in a more powerful way for database searches (e.g. for deduced protein sequences)

Genome annotation: which tools do we have for it?

(2)

and for experimental purposes (e.g. to design primers for exons). The second step is an information processing step, and is best described as functional annotation. It consists of attributing a range of specific biologically relevant infor-mation (e.g. species, source, gene, gene product or domain name and function) to the sequence as a whole, to each compound object and to each individual gene component. This information may reside inside the database proper, for example in specified fields and in feature tables as in the GenBank/EMBL/DDBJ public DNA databases or may be obtained through links to other databases [8,9•].

SWISS-PROT, which now offers links to 29 different data-bases [10•], is the best example of functional annotation

and this expert curated protein database represents a turn-ing point in interconnected databases.

Structural annotation: finding and locating the genes

Computational approaches to gene finding are a very active field in bioinformatics with significant advances in the past few years. Several reviews aimed at a variety of readerships have been written recently on this issue [11,12•,13,14,15••,16••] and a useful bibliography on

com-putational gene recognition is maintained on the Internet by Li [17]. Fundamentally, there are two ways to find a gene. The more obvious way is to search in sequence data-bases for a homologue. Besides giving the exon/intron structure, the homology approach also gives some indica-tion of gene funcindica-tion. This is done using the familiar BLAST or FASTA suite of programs; BLASTN searching in DNA databases is useful for the really close homologues which include expressed sequence tags (ESTs) whereas BLASTX searching in protein databases is much more sen-sitive. New flavours of BLAST have been introduced: allowing gapping (BLAST2) and higher seach sensitivity through an iterative process (PSI-BLAST) [18], taking into account pattern information [19], or providing interactive and specially tuned searching for large contig analysis (PowerBLAST) [20]. These approaches, nevertheless, depend on the existence of homologues in the database, and on their correct annotation [3]. In practice, for the Arabidopsisgenome, a close homologue can be found for about one third of the genes, and either distant or partial similarities for an additional third, at best. Modeling gene structure from homology alone is valid only for the first third. Specific software such as PROCRUSTES [21•,22],

EST_GENOME [23], sim4 [24•] and EbEST [25] has

been written to find structure genes through comparison with ESTs or distant cDNAs. AAT is an analysis and anno-tation tool with programs for comparing the query sequence with a protein database [26], and is the only one routinely used for Arabidopsis. Bork and Koonin [27] have given a perceptive analysis of the issue of predicted protein sequences and pointed to the potential problems and bot-tlenecks of this exercise these being the lack of an accepted gene prediction system integrating a robust and updated suite of sequence analysis methods first, and second, the poor or misleading information in databases which easily ends up in bad annotations which then propagate.

To search for genes with no homologue, and to confirm and extend the ones with poor homologues, programs have been developed to find genes only from the knowledge of the sequence. They rely on characteristic gene properties that leave clues in their sequence [11] which are mostly linked to their ability to be expressed, for example regu-larity and nucleotide bias in the coding sequence, motifs and composition bias for transcription and translation. Many algorithms have been written to find either gene ele-ments (e.g. start or splice sites), compound objects (e.g. exons) or to model genes as a whole. Obviously the task differs from organism to organism. For prokaryotes, where the coding information is dense and uninterrupted, it is easier than for higher eukaryotes; two gene prediction programs are the current standards: GeneMark [28] and Glimmer [29]. For eukaryotes, software development is largely driven towards the human genome. This software, as well as software devised for yeast or C. eleganscannot be used as such for the plant genomes. Even if the basic mol-ecular gene replication and expression mechanisms are conserved, there are significant taxa peculiarities (e.g. trans-splicing in worms) and the genome style varia-tion is a reality [30•]. Software have to be specifically

developed for each genome, therefore, or adapted, at best retrained, from one genome to the other. Software devel-oped for Arabidopsis, a dicot, is often not adapted to rice, a monocot. Likewise, the validation measures made for a given software in a given species, for example humans [31], do not necessarily apply in another species for which this software may have been adapted, such as Arabidopsis.

Several annotation programs have recently been developed or adapted for gene prediction in Arabidopsis, most of them used by the various AGI consortiums (see Tables 1 and 2) Splice Predictor [32], NetPlantGene [33] and NetGene2 [34•] for splice site prediction, NetStart [35] for translational

start, GeneMark [28], the ACeDB GeneFinder (P Green, personal communication), GRAIL [36], MZEF [37•],

Solovyev’s gene-finder suite (FGENEA, FEXA and ASPL) [38] for exon prediction, GENSCAN [39•] and

GeneMark.hmm [40] for gene modelling on both strands. No evaluation on the respective performance of this software is yet available, but we are in the process of assessing it. Suprisingly, GeneMark.hmm, a software not yet utilized for annotation, appears to be the most efficient gene predictor for Arabidopsis(N Pavy, S Rombauts, VV Ramana Daluvuri, C Mathe, P Dehais et al., unpublished data). GeneGenerator [41•] is an integrative gene prediction software developed for

(3)

the human [13•] andC. elegans [42] genomes. In order to

improve the prediction of gene borders and to gain further biologically relevant information, the prediction of 5′and 3′ untranslated regions [43] and especially of core promoters have been attempted [44•,45••] with limited success so far

[15••]. Computer prediction of tRNA genes is efficient [46].

Other biological objects or remarkable features are often reported, such as transposable elements and repeats in Arabidopsisfor which two useful databases have been devel-oped [47]. Software for MAR prediction has recently been developed, on the basis of statistics of motifs often found associated with MARs vertebrates [5]; its value for plants needs further checking.

Functional annotation — the link to biological knowledge

Functional annotation of gene database entries relies on the expertise of scientists. For genome annotation, especially for the first pass, which has to be performed shortly after the release of a contig sequence, the annotator does not have the time to act as an expert. As there is no a priori knowledge of the sequence, every attribute of information has to be assessed by analogy, using a variety of computer tools. Many aspects of this process are well analysed in a recent exhaus-tive review [48••]. One should notice first that, up to now, it

is essentially the molecular function of the gene products that are annotated this way, and only more rarely the biolog-ical function of the gene. The information on protein function can be derived from full length sequence compar-isons, or better through multiple alignments, pattern and domain searches and predictions of location and structure for which a large spectrum of computer tools — too numerous to be cited here — are now available [16••], many of them

through the Internet. Predicting the molecular function of

the putative proteins encoded by the genes has many docu-mented pitfalls [4•,27,48••], which include the risk of

propagating a wrong annotation [3]. The only way to prevent this would be for experts to check manually each entry [4•].

Clearly, the quality of the previous structural annotation step will affect the quality of functional annotation. This is espe-cially true for the many cases where the structural annotation comes from the interaction between intrinsic gene prediction and extrinsic database searches. This often ends up with cor-rect labels given to genes listed with the wrong structure, or to wrong labels being given to genes with the correct struc-ture, or both, as we observed ourselves for Arabidopsis.

As an alternative, or a help to the time-consuming manual human expertise, active research towards automatic or com-puter-assisted procedures to fetch knowledge in literature is ongoing [49–51], but is only practical at present if the lit-erature source is a normalised text such as database entries. For proper annotation, nomenclature is a major concern which has been underevaluated. Fortunately for plants, the International Society for Plant Molecular Biology has taken an early initiative to work towards a kingdom-wide gene nomenclature available in the MENDEL database [52•].

Generating genome-wide annotations

The conversion of raw genome data into knowledge, here specifically sequence annotation, involves a series of inter-dependent tasks. Manual annotation is no longer a valid solution and several systems have been designed to cope with this flow of incoming genome sequences. Automatic annotation such as that done by GeneQuiz [53] has proven to be error prone in prokaryotes [4•]. Genotator [54] and

GAIA [55] are interactive graphical environments, allowing Table 1

The ArabidopsisGenome Initiative consortia and their annotation tools.

(a)Web sites of the consortia involved in Arabidopsissequencing and annotation groups. Consortium

TIGR The Institute for Genomic Research http://www.tigr.org/tdb/at/atgenome.html

CSHL Cold Spring Harbor Laboratory http://nucleus.cshl.org/protarab/

KAOS Kazusa Institute http://zearth.kazusa.or.jp/arabi/

SPP Stanford http://sequencewww.stanford.edu/SPP.html

University of Pennsylvania Plant Genome Expression Center

ESSA Munich Information center for Protein Sequences (MIPS) http://www.mips.biochem.mpg.de/proj/thal/ Genoscope Centre National de séquençage (France), annotation: MIPS http://www.genoscope.cns.fr/

(b)Prediction programs used by the different annotation groups.

Annotators NetPlant MZEF Gene-Finder GeneFinder GENSCAN GRAIL GeneMark AAT TRNA

Gene SCAN-SE

TIGR x x x x x x

CHSL x x x x x x x

SPP x x x x x x

MIPS x x x x x x*

KAOS x x x x x†

(4)

human expertise, that have recently been designed to face the needs of eukaryote genome annotations. In a further category, Imagene [56•] which has recently been used for B. subtilisannotation, has the additional capacity to chain the various component tasks needed for annotation, in sce-narios piloted by their individual results. For Arabidopsisa simpler management tool called ANNOTATOR is avail-able [26] and is used by several AGI consortiums. As stated by Bork and Koonin [27], ‘there is’ obviously ‘a lack of a widely accepted, robust and continuously updated suite of sequence analysis methods integrated into a coherent and efficient prediction system’.

Proper annotation of genome data is crucial for genomics [48••]. The huge potential of genome-wide approaches

resides indeed in the capacity to integrate the various kinds of data into a higher level of biological knowledge for one species and to allow comparison between genomes. This implies that some work on semantics and ontology is necessary to allow a full interconnectivity between data-bases [9•]. Gelbart [57] gives a striking example of such a

need when debating the definition of a gene, which has a conceptual definition for geneticists but is ascribed to a sequence entity in databases. This debate has practical implications in the way the data should be represented in the different databases: insertion mutagenesis, expression, proteome and sequence databases that will all be compo-nents of an Arabidopsis knowledge base. Special care

should be paid to this issue in plants where relatively few cases of alternative expression of genes are experimentally documented compared to vertebrates, but where the real extent of their occurrence is unknown.

Open problems and future trends

Although many advanced annotation tools exist, compu-tational annotation of genome sequences is still in its infancy. Gene finding software must still be significantly improved, especially for plant genomes which are not the main spot for developments. As an example, exon predic-tion programs have been improved taking into account the clustering of Arabidopsis coding sequences into two classes [58•]. The existing tools are intended to find the

mainstream highly expressed protein-encoding genes and have neglected many other genes of biological signifi-cance, such as low expressed genes, RNA-encoding genes (besides tRNAs), rare introns [59,60•], or objects

with less available information, such as promoters, tran-scriptional and translational control regions or MARs. There is ample room for the improvement of gene mod-eling and gene function prediction using improved identification and classification of promoters and tran-scriptional control regions. Annotation will evolve with time to cope with and feed new experiments; it is of major importance to track the way annotation has devel-oped, and to clearly identify what has an experimental basis and what is derived information.

Table 2

Web sites for annotation programs.

Program URL

PROCRUSTES http://www-hto.usc.edu/software/procrustes/index.html

EbEST http://ares.ifrc.mcw.edu/EBEST/ebest.html

AAT (analysis and annotation tool) http://genome.cs.mtu.edu/aat.html ALIGN (pairwise sequence alignment) http://genome.cs.mtu.edu/align/align.html MAP (multiple sequence alignment) http://genome.cs.mtu.edu/map/map.html

GLIMMER http://www.cs.jhu.edu/labs/compbio/glimmer.html

GRAIL http://compbio.ornl.gov/Grail-1.3/intro.html

SPLICEPREDICTOR http://gremlin1.zoo1.iastate.edu/~volker/SplicePredictor.html

NETPLANTGENE http://www.cbs.dtu.dk/services/NetPGene/

NETGENE2 http://www.cbs.dtu.dk/services/NetGene2/

NETSTART http://www.cbs.dtu.dk/services/NetStart/

MZEF http://siclio.cshl.org/genefinder/

FGENEA, FEXA and ASPL http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

GENSCAN http://genomic.stanford.edu/GENSCANW.html

GeneMark.hmm http://genemark.biology.gatech.edu/GeneMark.hmmchoice.html

GeneMark http://genemark.biology.gatech.edu/GeneMark/webgenemark.html

tRNAscan-SE http://www.genetics.wustl.edu/eddy/ tRNAscan-se/

(5)

The status of Arabidopsisgenome annotation will change dra-matically with the availability of expression and insertion data [7•], which will secure the structural and functional

annotation of a large fraction of the genes, and serve to improve the computational methods for the rest awaiting experimental support. Similarly, sequencing the rice genome will also change deeply our understanding and annotation of the genes which are common to both plants. The challenge is in our capacity to integrate these many data, as well as to use the dispersed expertise of the whole research community.

Note added in proof

The work referred to in the text as N Terryn et al. has now been accepted for publication [62].

Acknowledgements

We gratefully acknowledge Roderic Guigo, Juergen Kleffe, Klaus Mayer and Larry Parnell for communicating unpublished information, Patrice Déhais and Luc Van Wiemeersch for continuous expert support with hardware and software and Martine de Cock for editorial assistance. Pierre Rouzé is a research director and Natalie Pavy a contract scientist of the Institut National de Recherche Agronomique (France).

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

• of special interest ••of outstanding interest

1. Bevan M, Bancroft I, Bent E, Love K, Goodman H, Dean C, Bergkamp R, Dirkse W, Van Staveren M, Stiekema W et al.: Analysis of 1.9 Mb

of contiguous sequence from chromosome 4 of Arabidopsis

thaliana.Nature1998, 391:485-488.

2. Meinke DZ, Cherry JM, Dean C, Rounsley SD, Koornneef M:Arabidopsis

•• thaliana: a model plant for genome analysis.Science1998, 282:662-682.

This is a very useful review and progress report on the present status of genomics research using Arabidopsis thalianaas a model plant.

3. Wheelan SJ, Boguski MS:Late-night thoughts on the sequence

annotation problem.Genome Res1998, 8:168-169.

4. Galperin MY, Koonin EV: Sources of systematic error in functional

• annotation of genomes: domain rearrangement, non-orthologous

gene displacement, and operon disruption.In Silico Biol1998,

1:0007.

By looking at bacterial genome annotations obtained through similarity search, the authors identify several causes of inadequate annotation which lead to ‘database explosion’: non-critical use of databases and functional inference, low complexity regions, ignoring multi-domain organisation (http://www.bioinfo.de/isb/1998/01/0007/). A useful paper, together with [27], for those who would like to avoid similar mistakes.

5. Singh GB, Kramer JA, Krawetz SA:Mathematical model to predict

regions of chromatin attachment to the nuclear matrix.Nucleic

Acids Res1997, 25:1419-1425.

6. Preston GM, Haubold B, Rainey PB: Bacterial genomics and adaptation to life on plants: implications for the evolution of

pathogenicity and symbiosis.Curr Opin Microbiol 1998,

1:589-597.

7. Bouchez D, Höfte H: Functional genomics in plants.Plant Physiol • 1998, 118:725-732.

The first comprehensive review giving a broad spectrum of the ‘toolbox for the global study of gene function in plants’ and critically presenting the dif-ferent methods presently available.

8. Baker PG, Brass A: Recent developments in biological sequence

databases.Curr Opin Biotech1998, 9:54-58

9. Frishman D, Heumann K, Lesk A, Mewes HW: Comprehensive,

• comprehensible, distributed and intelligent databases: current

status.Bioinformatics1998, 14:551-561.

A review focusing on the challenges, and the conceptual and technical aspects of biological databases.

10. Bairoch A, Apweiler R: The Swiss-Prot protein sequence data bank

• and its supplement TrEMBL in 1999.Nucleic Acids Res1999, 27:49-54.

A short paper presenting the reality of an expert-curated database and its computer-generated complement, as a relational node between various databases. The entire issue of the journal is worth consulting to discover the wide range of existing biological databases.

11. Fickett JW: The gene identification problem: an overview for

developers. Comput Chem1996, 20:103-118.

12. JM Claverie: Computational methods for the identification of genes in

• vertebrate genomic sequences.Human Mol Genet1997,6:1735-1744.

A critical and comparative review on the software available for gene predic-tion, with a historical perspective over the last 15 years and an analytical pre-sentation of the methods behind the software. The author concludes by pointing to conservatism as the common limitation of current methods.

13. Guigo R:Computational gene identification: an open problem.

• Comput Chem1997,21:215-222.

The author presents the current limitations of gene finding programs and their low efficiency for gene modeling.

14. Krogh A: Gene finding: putting the parts together.In Guide to Human Genome Computing, edn 2.Edited by Martin Bishop. Oxford: Academic Press; 1998:261-274.

15. Burge CB, Karlin S: Finding the genes in genomic DNA.Curr Opin •• Struct Biol1998, 8:346-354.

This review presents the rationale of gene finding, and is useful to understand the methodologies (especially GENSCAN) and analyse the current limitations.

16. Trends guide to Bioinformatics Trends Supplement

•• 1998.

This very useful supplement to the Trendsseries of journals contains a few articles that are either fundamental or pragmatic and easy to read for their target audience of biologists. They include features on the most important every-day issues in using bioinformatics and deal with computational issues arising from genome annotation.

17. Computational gene recognition on World Wide Web URL: http://linkage.rockefeller.edu/wli/gene/

18. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman

DJ: Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs.Nucleic Acids Res1997, 25:3389-3402.

19. Zhang Z, Schäffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF:Protein sequence similarity searches using patterns

as seeds.Nucleic Acids Res1998, 26:3986-3990.

20. Zhang J, Madden TL: PowerBLAST: a new network BLAST application for interactive or automated sequence analysis and

annotation.Genome Res1997, 7:649-656.

21. Sze SH, Pevzner PA: Las Vegas algorithms for gene recognition:

• suboptimal and error-tolerant spliced alignment.J Comput Biol

1997, 4:297-309.

This is a paper presenting a more efficient algorithm for the spliced align-ment method used in PROCRUSTES [22], the principle of which is to search using the exon-intron boundaries that maximize the score of align-ment of a cDNA with the genomic sequence.

22. Mironov AA, Roytberg MA, Pevzner PA, Gelfand MS:Performance

guarantee gene predictions via spliced alignment.Genomics

1998, 51:332-339.

23. Mott R: EST_GENOME: a program to align spliced DNA sequences

to unspliced genomic DNA.Comput Appl Biosci 1997, 13:477-478.

24. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer

• program for aligning a cDNA sequence with a genomic DNA

sequence.Genome Res1998, 8:967-974.

This paper presents a new program which can be used to find splice sites by homology searching, and compares its performance to previous similar tools.

25. Jiang J, Jacob HJ: EbEST: an automated tool using expressed

sequence tags to delineate gene structure.Genome Res1998,

8:268-275.

26. Huang X, Adams MD, Zhou H, Kerlavage AR: A tool for analyzing

and annotating genomic sequences. Genomics1997,46:37-45.

27. Bork P, Koonin EV: Predicting functions from protein sequences:

where are the bottlenecks?Nat Genet1998, 18:313-318.

28. Borodovsky M, McIninch J: GENMARK: parallel gene recognition

for both DNA strands. Comput Chem1993, 17:123-133.

29. Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene

identification using interpolated Markov models.Nucleic Acids

(6)

30. Karlin S: Global dinucleotide signatures and analysis of genomic

• heterogeneity.Curr Opin Microbiol 1998,1:598-610

This paper reviews the concept of ‘genomic signature’ introduced by the author to refer to the remarkable stability of dinucleotide abundance in an organism and its conservation in closely related taxa, allowing discrimination between organisms and identification of horizontal transfer.

31. Burset M, Guigo R: Evaluation of gene structure prediction

programs. Genomics1996, 34:353-367.

32. Brendel V, Kleffe J: Prediction of locally optimal splice sites in plant

pre-mRNA with applications to gene identification in Arabidopsis

thalianagenomic DNA. Nucleic Acids Res 1998, 26:4748-4757. 33. Hebsgaard SM, Korning PG, Tolstup N, Engelbrecht J, Rouzé P,

Brunak S: Splice site prediction in Arabidopsis thaliana pre-mRNA

by combining local and global sequence information. Nucleic

Acids Res 1996, 24:3439-3452.

34. Tolstrup N, Rouzé P, Brunak S: A branch point consensus from

• Arabidopsis found by non-circular analysis allows for better

prediction of acceptor sites. Nucleic Acids Res1997, 25:3159-3163.

In this paper, the authors show first that it is possible to localize the branch-point in documented Arabidopsis intron sequences using a Hidden Markov Model of the U2 snRNA site, and then that predicting intron branch-points in contigs in addition to splice sites [33] increases the prediction performance for acceptor sites. NetGene2 uses this method and in addition predicts the position of the splice sites in the codon, which could be used later on in gene modeling.

35. Pedersen AG, Nielsen H: Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome

analysis. InProceedings of the Fifth International Conference on

Intelligent Systems for Molecular Biology.Menlo Park: AAAI Press; 1997:226-233.

36. Xu Y, Uberbacher EC: Automated gene identification in large-scale

genomic sequences. J Comput Biol1997, 4:325-338.

37. Zhang MQ: Identification of protein coding regions in Arabidopsis

• thalianagenome based on quadratic discriminant analysis.Plant Mol Biol1998, 37:803-806.

This paper is an application of a previous one dedicated to the human genome, describing MZEF for Arabidopsis.The method is using quadratic discriminant analysis, a generalisation of linear discriminant analysis, as used by Solovyev and Salamov [38].

38. Solovyev V, Salamov A: The gene-finder computer tools for analysis

of human and model organisms genome sequences.In Proceedings

of the Fifth International Conference on Intelligent Systems for Molecular Biology.Menlo Park: AAAI Press; 1997:294-302.

39. Burge C, Karlin S: Prediction of complete gene structures in

• human genomic DNA. J Mol Biol1997, 268:78-94.

This paper describes GENSCAN, the most efficient gene prediction soft-ware to date for vertebrates. GENSCAN makes use of many signals and compositional statistical properties, with a dependency rule for splice sites. It allows the search for multiple genes on both strands.

40. Lukashin AV, Borodovsky M: GeneMark.hmm: new solutions for

gene finding.Nucleic Acids Res 1998, 26:1107-1115.

41. Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V:

• GeneGenerator — a flexible algorithm for gene prediction and its

application to maize sequences.Bioinformatics1998, 14:232-243.

The first gene prediction tool for a monocot, using a low information con-ceptual approach.

42. The C. elegans Sequencing Consortium: Genome sequence of the

• nematode C. elegans: a platform for investigating biology.Science

1998, 282:2012-2018.

The first report on the full sequence of a higher eukaryote, with a genome size comparable to Arabidopsis. The accompanying papers from the same issue are worth reading too, as an illustration of what genomics is offering to biology now.

43. Milanesi L, Muselli M, Arrigo P: Hamming-clustering method for

signals prediction in 5¢and 3¢regions of eukaryotic genes.

Comput Appl Biosci 1996, 12:399-404.

44. Fickett JW, Hatzigeorgiou AG: Eukaryotic promoter recognition.

• Genome Res1997, 7:861-878.

A documented review on computational promoter recognition, with an excel-lent first part on biology and a second on a fair comparative analysis of the available tools. The conclusion shows that there is much to be done yet.

45. Duret L, Bucher P: Searching for regulatory elements in human

•• noncoding sequences.Curr Opin Struct Biol1997, 7:399-406.

Another, more conceptual, view on the same topic which presents the power of a comparative phylogenetic approach to identify new unknown regulatory elements.

46. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved

detection of transfer RNA genes in genomic sequence. Nucleic

Acids Res1997, 25:955-964.

47. Annotation of transposons and genomic repeats on World Wide Web URL: http://nucleus.cshl.org/protarab/TnAnnotation.htm

48. Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y:

•• Predicting function: from genes to genomes and back.J Mol Biol

1998, 283:707-725.

This paper raises the issue of genome-wide annotation, setting to bridge the gap between phenotype and genotype as a goal. An interesting point made in this review is on using the added value provided by completely sequenced genomes for gene function prediction.

49. Ohta Y, Yamamoto Y, Okazaki T, Uchiyama I, Takagi T: Automatic

constructing of knowledge base from biological papers.In

Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology. Menlo Park: AAAI Press; 1997:218-225.

50. Andrade M, Valencia A: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein

families.Bioinformatics1998, 14:600-607.

51. Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B: Detecting gene symbols and names in biological texts: a first step toward

pertinent information extraction.InProceedings from the Ninth

Workshop on Genome Informatics. Edited by Miyano S, Takagi T. Tokyo: Universal Academy Press Inc.; 1998:72-80.

52. Lonsdale D: Nomenclature regulation.Nature1998,

• 391:118.

The International Society for Plant Molecular Biology initiative to develop a kingdom-wide nomenclature can be found on the World Wide Web URL: MENDEL: http://www.mendel.ac.uk/ or http://genome-www.stanford.edu/ Mendel/ see also ATDB: http://genome-www.stanford.edu/Arabidopsis/ nomencl.html for additional information on nomenclature.

53. Sharf M, Schneider R, Casar G, Bork P, Valencia A, Ouzounis C, Sander C: GeneQuiz: a workbench for sequence analysis.In

Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park: AAAI Press; 1994:348-353.

54. Harris NL: Genotator: a workbench for sequence annotation. Genome Res1997, 7:754-762.

55. Bailey LC, Fisher S, Schug J, Crabtree J, Gibson M, Overton GC:

GAIA: framework annotation of genomic sequence.Genome Res

1998, 8:234-250.

56. Médigue C, Rechenmann F, Danchin A, Viari A: Imagene: an

• integrated computer environment for sequence annotation and

analysis.Bioinformatics1999, 15:in press.

A new kind of integrated annotation environment is presented which is a task management platform incorporating a data management system and a graphical tool. A prototype of Imagene was used for B. subtilisgenome sequence annotation.

57. Gelbart W: Databases in genomic research.Science1998, 282:659-661.

58. Mathé C, Peresetsky A, Déhais P, Van Montagu M, Rouzé P:

• Classification of Arabidopsis thalianagene sequences: clustering

of coding sequences into two groups according to codon usage

improves gene prediction.J Mol Biol1999, 285:1977-1991.

This paper reports the clustering of Arabidopsiscoding sequences into two classes according to their codon usage, the main difference lying in the use of T and C at the third codon position. The classes correlate with gene expression. Using GeneMark, it was shown that using two gene models instead of one improves gene prediction.

59. Wu HJ, Gaubier-Comella P, Delseny M, Grellet F, Van Montagu M, Rouzé P: AU-AA splicing in Arabidopsis: non-canonical introns are

at least 109years old.Nat Genet1996, 14:383-384.

60. Sharpe PA, Burge CB: Classification of introns: U2-type or U12

• type.Cell1997, 91:875-879.

A synthetic presentation of the recent finding of a second class of spliceo-somal introns found in animal and plants [59]. Their distinctive splicing and signature are described together with the donor and the branch sites.

61. Fichant GA, Burks C: Identifying potential tRNA genes in genomic

DNA sequences.J Mol Biol1991, 220:659-671.

62. Terryn N, Heijnen L, De Keyser A, Van Asseldonck M, DeClercq R:

Evidence for an ancient chromosomal duplication in Arabidopsis

thalianaby sequencing and analysing a 400 kb contig at the

Gambar

Table 1
Table 2

Referensi

Dokumen terkait

Oleh karena itu, dibuatlah suatu aplikasi sistem antrian berbasis Android yang dapat memberikan informasi perkiraan waktu tunggu bagi Customer yang sedang berada

[r]

Jika sudah, siswa baru bisa mengisi angket untuk periode itu dengan memilih jadwal pelajaran yang sedang ada saat ini untuk menghasilkan output berupa laporan

[r]

Pada Proses pemesanan barang, data barang yang dipesan dan biaya kirim akan dicatat dan sistem akan memberikan konfirmasi pencatatan pesanan yang dilakukan oleh customer

[r]

Setelah semua juri telah selesai memberikan penilaiannya terhadap semua peserta pada suatu babak, Administrator akan dapat melihat nilai-nilai tersebut dalam suatu

5 Set Kota Bekasi 500,000,000 APBD 479,032,000 Lelang Sederhana TW II 45 Hari 5 Pengadaan Alat Peraga IPS (Globe Interaktif & Peta Interaktif) Pengadaan Alat Peraga IPS