www.unistra.fr



THEORETICAL BIOINFORMATICS

Responsable Prof. Christian MICHEL

michel@dpt-info.u-strasbg.fr

Bioinformatique Théorique

FDBT, LSIIT (UMR CNRS-ULP 7005)

Université de Strasbourg

Pôle API, Boulevard Sébastien Brant

BP 10413

67412 Illkirch, France

Site: http://dpt-info.u-strasbg.fr/~michel/

NEWS 2009


MEMBERS

BIOINFORMATICS RESEARCH FIELDS

THE GENETIC CODE

an attractive problem solved with 2 codes ?

GENE EVOLUTION

linear, nonlinear or pseudochaotic ?

RESEARCH SOFTWARES

PUBLICATIONS

LECTURES

PERSONAL DATA

MEMBERS

Permanent members

Christian MICHEL, Professor, michel@dpt-info.u-strasbg.fr

Gabriel FREY, Assistant professor (Maître de Conférences), g.frey@dpt-info.u-strasbg.fr

Sophie LEBRE, Assistant professor (Maître de Conférences), lebre@dpt-info.u-strasbg.fr

Non-permanent members

Ahmed AHMED, PhD student since the 1/09/2005, ahmed@dpt-info.u-strasbg.fr

Emmanuel BENARD, PhD student since the 1/09/2007, benard@dpt-info.u-strasbg.fr

Collaborations

GDR "Bioinformatique Moléculaire"

GDR "Informatique Mathématique" in the research group "Combinatoire des mots, algorithmique du texte et du génome"

Prof. Jacques Bahi, Université de Franche-Comté

Dr Alexis Criscuolo, Institut Pasteur

Dr Giuseppe Pirillo, Consiglio Nazionale delle Ricerche, Dipartimento di Matematica “U.Dini”, Florence

Top page


BIOINFORMATICS RESEARCH FIELDS

The objectives of the Theoretical Bioinformatics group are placed on the level of fundamental and theoretical knowledge with the identification of rules and properties in genes.

Review: Article [A38]

Circular codes (Identification in genes, genomes, microRNAs; Absence in frameshift genes; Genetic codes; Biological properties; Mathematical properties; Code theory): Articles [A41] [A40] [A39] [A36] [A33] [A30] [A28] [A27] [A22] [A21] [A19]

Stochastic models of gene evolution (Linear; Nonlinear; Pseudochaotic; Applications and evaluations with circular codes): Articles [A43] [A42] [A37] [A35] [A34] [A32] [A31] [A24] [A23] [A17] [A15] [A13]

Phylogeny (Distances): Articles [A44] [A37] [A35]

Computer simulation of gene evolution (Langages; Stochastic automata; Applications with genes, introns, 5' and 3' regions, etc.): Articles [A20] [A18] [A16] [A14] [A12] [A11] [A10] [A8]

Regulatory network theory: Articles [A26] [A25]

Computational methods (Statistical methods; Signal processing; Periodicities identification; Motifs identification; Genome annotation; Entropy; Alignment; Research softwares): Articles [A33] [A28] [A27] [A18] [A16] [A14] [A11] [A9] [A8] [A7] [A6] [A5] [A4] [A3] [A2] [A1]

(an article with 2 research fields may be mentioned twice)

Top page


THE GENETIC CODE

An attractive problem solved with 2 codes ?

A genetic code for coding the amino acids and a circular code to retrieve the reading frames of genes.

A summary [PDF]

The genetic code associates trinucleotides (triletters) over the 4-letter alphabet {A,C,G,T} with amino acids (letters) over a 20-letter alphabet. There are 61 trinucleotides among 4³=64 coding 20 amino acids because the three stop trinucleotides {TAA,TAG,TGA} do not code. There are three start trinucleotides {ATG,GTG,TTG} where ATG is the standard one that codes the methionine amino acid. These start and stop trinucleotides close a series of nucleotides (letters) in a genome which are translates from three in three nucleotides by the genetic code. This particular series of trinucleotides in a reading frame (also called codons), defines a gene which codes a series of amino acids constituting a protein.

Fifty years ago (in 1957), before the discovery of the genetic code, a class of trinucleotide codes, called comma-free codes (or codes without commas) was proposed by Crick et al. [Crick, Griffith, Orgel. Proc. Natl. Acad. Sci. 43, 416-421,1957] for explaining how the reading of a series of trinucleotides could code amino acids. The two questions of interest were: why are there more trinucleotides than amino acids and, how does one choose the reading frame?

Crick et al. (1957) proposed that only 20 trinucleotides among 64 code the 20 amino acids. Such a bijective code implies that the coding trinucleotides are found only in one frame. The determination of a set of 20 trinucleotides forming a comma-free code has several constraints:

(i) A trinucleotide with identical nucleotides must be excluded from such a code. Indeed, the concatenation of AAA with itself (for instance) does not allow the (original) reading frame to be retrieved as there are three possible decompositions: ...AAA,AAA,AAA,..., ...A,AAA,AAA,AA... and ...AA,AAA,AAA,A..., the commas showing the adopted decomposition.

(ii) Two trinucleotides related to circular permutation, for example AAC and ACA, must also be excluded from such a code. Indeed, the concatenation of AAC with itself (for instance) does not allow the reading frame to be retrieved as there are two possible decompositions: …AAC,AAC,AAC,… and …A,ACA,ACA,AC….

Therefore, by excluding the four trinucleotides with identical nucleotides AAA, CCC, GGG and TTT and by gathering the 60 remaining trinucleotides in 20 classes of three trinucleotides such that, in each class, three trinucleotides are deduced from each other by circular permutations, e.g., AAC, ACA and CAA, we see that a comma-free code has only one trinucleotide per class and therefore contains at most 20 trinucleotides. This trinucleotide number is identical to the amino acid number, thus leading to a code assigning one trinucleotide per amino acid without ambiguity.

Some basic results on trinucleotide comma-free codes were obtained by Golomb et al. [Golomb, Gordon, Welch. Canad. J. Math. 10, 202-209, 1958; Golomb, Welch, Delbrück. Biol. Medd. Dan. Vid. Selsk. 23, 1958]. However, no trinucleotide comma-free codes have been identified in genes statistically. Furthermore, in the late fifties, the discovery that the trinucleotide TTT, an excluded trinucleotide in a comma-free code, codes phenylalanine [Nirenberg, Matthaei. Proc. Natl. Acad. Sci. 47, 1588-1602, 1961], led to the abandonment of the concept of a comma-free code over the alphabet {A,C,G,T}. For several biological reasons, in particular the interaction between mRNA and tRNA, this concept was again taken up later over the purine/pyrimidine alphabet {R,Y} (R={A,G}, Y={C,T}) with two trinucleotide comma-free codes for primitive genes: RRY [Crick, Brenner, Klug, Pieczenik. Origins of Life 7, 389-397, 1976] and RNY (N={R,Y}) [Eigen, Schuster. Naturwissenschaften 65, 341-369, 1978].

Back in 1996, a statistical study of trinucleotide occurrences per frame has identified a set X(EUK_PRO) of 20 trinucleotides in the gene populations of both eukaryotes EUK and prokaryotes PRO [Arquès, Michel. J. Theor. Biol. 182, 45-58, 1996]. This set is a trinucleotide circular code with several strong biomathematical properties. A circular code which has weaker conditions compared to a comma-free code is a set of words over an alphabet such that any word written on a circle has at most one decomposition into words of the circular code [Lassez. Int. J. Comput. Syst. Sciences 5, 201-208, 1976]. The construction frame of a word generated by any concatenation of words of a circular code can be retrieved after the reading, anywhere in the generated word, of a certain number of nucleotides depending on the code. This series of nucleotides is called the window of the circular code. The minimal window length is the size of the longest ambiguous word that can be read in at least two frames, added with one letter.

Similar to the existence of variant genetic codes (compared to the universal one), several trinucleotide circular codes have been found in genes: one code X(MIT) in mitochondria [Arquès, Michel. J. Theor. Biol. 189, 273-290, 1997], 15 codes X(GArchaea) in archaeal genomes [Frey, Michel. J. Theor. Biol. 223, 413-431, 2003] and 72 codes X(GBacteria) in 175 complete bacterial genomes (with several bacterial genomes having the same codes) [Frey, Michel. J. Comput. Biol. Chem. 30, 87-101, 2006].

Recently, we showed that the circular code may also have biological functions in frameshift genes and microRNAs.

Top page


GENE EVOLUTION

A summary [PDF]

(under construction)

Top page


RESEARCH SOFTWARES

Serveur partially available.

SEGMweb (Stochastic Evolution of Genetic Motifs) (Benard E., Michel C.J., 2009) to determine evolutionary analytical solutions of nucleotides, dinucleotides and trinucleotides

http://lsiit-bioinfo.u-strasbg.fr:8080/webMathematica/SEGM/SEGM.html

DNAdistree (Criscuolo A., Michel C.J., 2009) to infer phylogenetic trees according to distance methods based on weighted phylogenetic distances

http://lsiit-bioinfo.u-strasbg.fr:8080/DNADISTREE/index.html

FPTFweb (Frame Permutated Trinucleotide Frequency) (Frey G., Jung M., Michel C.J., 2008) to identify preferential trinucleotide sets in the 3 frames of genes and to determine if they are circular codes or not

http://lsiit-bioinfo.u-strasbg.fr:8080/FPTFweb/index.jsp

iGOTdatabase (integrated Gene Ontology and Taxonomy) (Ahmed A., Frey G., Michel C.J., 2009)

http://lsiit-bioinfo.u-strasbg.fr/iGOT/

Top page


ARTICLES IN INTERNATIONAL JOURNALS

2009

[A45] Benard E., Michel C.J. (2009). Computation of direct and inverse mutations with the SEGM web server (Stochastic Evolution of Genetic Motifs): an application to splice sites of human genome introns. Journal of Computational Biology and Chemistry 33, 245-252. [PDF]

[A44] Criscuolo A., Michel C.J. (2009). Phylogenetic inference with weighted codon evolutionary distances. Journal of Molecular Evolution 68, 377-392. [PDF]

[A43] Bahi J.M., Michel C.J. (2009). A stochastic model of gene evolution with time dependent pseudochaotic mutations. Bulletin of Mathematical Biology 71, 681-700. [PDF]

2008

[A42] Bahi J.M., Michel C.J. (2008). A stochastic model of gene evolution with chaotic mutations. Journal of Theoretical Biology 255, 53-63. [PDF]

[A41] Ahmed A., Michel C.J. (2008). Plant microRNA detection using the circular code information. Journal of Computational Biology and Chemistry 32, 400-405. [PDF]

[A40] Michel C.J., Pirillo G, Pirillo M.A. (2008). A relation between trinucleotide comma-free codes and trinucleotide circular codes. Theoretical Computer Science 401, 17-25. [PDF]

[A39] Michel C.J., Pirillo G, Pirillo M.A. (2008). Varieties of comma free codes. Computer and Mathematics with Applications 55, 989-996. [PDF]

[A38] Michel C.J. (2008). A 2006 review of circular codes in genes. Computer and Mathematics with Applications 55, 984-988. [PDF]

2007

[A37] Michel C.J. (2007). Evolution probabilities and phylogenetic distance of dinucleotides. Journal of Theoretical Biology 249, 271-277. [PDF]

[A36] Ahmed A., Frey G., Michel C.J. (2007). Frameshift signals in genes associated with the circular code. In Silico Biology 7, 155-168. [http://www.bioinfo.de/isb/2007/07/0016/]

[A35] Michel C.J. (2007). Codon phylogenetic distance. Journal of Computational Biology and Chemistry 31, 36-43. [PDF]

[A34] Michel C.J. (2007). An analytical model of gene evolution with 9 mutation parameters: an application to the amino acids coded by the common circular code. Bulletin of Mathematical Biology 69, 677-698. [PDF]

2006

[A33] Frey G., Michel C.J. (2006). Identification of circular codes in bacterial genomes and their use in a factorization method for retrieving the reading frames of genes. Journal of Computational Biology and Chemistry 30, 87-101. [PDF]

[A32] Frey G., Michel C.J. (2006). An analytical model of gene evolution with 6 mutation parameters: an application to archaeal circular codes. Journal of Computational Biology and Chemistry 30, 1-11. [PDF]

2004

[A31] Bahi J.M., Michel C.J. (2004). A stochastic gene evolution model with time dependent mutations. Bulletin of Mathematical Biology 66, 763-778. [PDF]

2003

[A30] Frey G., Michel C.J. (2003). Circular codes in archaeal genomes. Journal of Theoretical Biology 223, 413-431. [PDF]

[A29] Michel C.J. (2003). A computer method for identifying patterns in the electroencephalogram signals. Journal of Medical Engineering and Technology 27, 267-275. [PDF]

2002

[A28] Arquès D.G., Lacan J., Michel C.J. (2002). Identification of protein coding genes in genomes with statistical functions based on the circular code. Biosystems 66, 73-92. [PDF]

2001

[A27] Lacan J., Michel C.J. (2001). Analysis of a circular code model. Journal of Theoretical Biology 213, 159-170. [PDF]

2000

[A26] Bahi J.M., Michel C.J. (2000). Convergence of discrete asynchronous iterations. International Journal of Computer Mathematics 74, 113-125. [PDF]

1999

[A25] Bahi J.M., Michel C.J. (1999). Simulations of asynchronous evolution of discrete systems. Simulation Practice and Theory 7, 309-324. [PDF]

[A24] Arquès D.G., Fallot J.-P., Marsan, L., Michel C.J. (1999). An evolutionary analytical model of a complementary circular code. Biosystems 49, 83-103. [PDF]

1998

[A23] Arquès D.G., Fallot J.-P., Michel C.J. (1998). An evolutionary analytical model of a complementary circular code simulating the protein coding genes, the 5' and 3' regions. Bulletin of Mathematical Biology 60, 163-194. [PDF]

1997

[A22] Arquès D.G., Michel C.J. (1997). A circular code in the protein coding genes of mitochondria. Journal of Theoretical Biology 189, 273-290. [PDF]

[A21] Arquès D.G., Michel C.J. (1997). A code in the protein coding genes. Biosystems 44, 107-134. [PDF]

[A20] Arquès D.G., Fallot J.-P., Michel C.J. (1997). An evolutionary model of a complementary circular code. Journal of Theoretical Biology 185, 241-253. [PDF]

1996

[A19] Arquès D.G., Michel C.J. (1996). A complementary circular code in the protein coding genes. Journal of Theoretical Biology 182, 45-58. [PDF]

[A18] Arquès D.G., Fallot J.-P., Michel C.J. (1996). Identification of several types of periodicities in the collagens and their simulation. International Journal of Biological Macromolecules 19, 131-138. [PDF]

1995

[A17] Arquès D.G., Michel C.J. (1995). Analytical solutions of the dinucleotide probability after and before random mutations. Journal of Theoretical Biology 175, 533-544. [PDF]

[A16] Arquès D.G., Lapayre J.-C., Michel C.J. (1995). Identification and simulation of shifted periodicities common to protein coding genes of eukaryotes, prokaryotes and viruses. Journal of Theoretical Biology 172, 279-291. [PDF]

1994

[A15] Arquès D.G., Michel C.J. (1994). Analytical expression of the purine/pyrimidine autocorrelation function after and before random mutations. Mathematical Biosciences 123, 103-125. [PDF]

1993

[A14] Arquès D.G., Michel C.J. (1993). Identification and simulation of new non-random statistical properties common to different eukaryotic gene subpopulations. Biochimie 75, 399-407. [PDF]

[A13] Arquès D.G., Michel C.J. (1993). Analytical expression of the purine/pyrimidine codon probability after and before random mutations. Bulletin of Mathematical Biology 55, 1025-1038. [PDF]

[A12] Arquès D.G., Michel C.J. (1993). A model of gene evolution based on recognizable languages and on insertion and deletion operations. Modelling and Simulation 13, 110-113. [PDF]

[A11] Arquès D.G., Michel C.J., Orieux K. (1993). Identification and simulation of new non-random statistical properties common to different populations of eukaryotic non-coding genes. Journal of Theoretical Biology 161, 329-342. [PDF]

1992

[A10] Arquès D.G., Michel C.J. (1992). A simulation of the genetic periodicities modulo 2 and 3 with processes of nucleotide insertions and deletions. Journal of Theoretical Biology 156, 113-127. [PDF]

[A9] Arquès D.G., Michel C.J., Orieux K. (1992). Analysis of Gene Evolution: the software AGE. Computer Applications in the Biosciences 8, 5-14 (called Bioinformatics now). [PDF]

1990

[A8] Arquès D.G., Michel C.J. (1990). A model of DNA sequence evolution. Part 1: Statistical features and classification of gene populations, 743-753. Part 2: Simulation model, 753-766. Part 3: Return of the model to the reality, 766-770. Bulletin of Mathematical Biology 52, 741-772. [PDF]

[A7] Arquès D.G., Michel C.J. (1990). Periodicities in coding and noncoding regions of the genes. Journal of Theoretical Biology 143, 307-318. [PDF]

1989

[A6] Michel C.J. (1989). A study of the purine/pyrimidine codon occurrence with a reduced centered variable and an evaluation compared to the frequency statistic. Mathematical Biosciences 97, 161-177. [PDF]

1987

[A5] Arquès D.G., Michel C.J. (1987). Periodicities in introns. Nucleic Acids Research 15, 7581-7592. [PDF]

[A4] Arquès D.G., Michel C.J. (1987). A purine-pyrimidine motif verifying an identical presence in almost all gene taxonomic groups. Journal of Theoretical Biology 128, 457-461. [PDF]

[A3] Arquès D.G., Michel C.J. (1987). Study of a perturbation in the coding periodicity. Mathematical Biosciences 86, 1-14. [PDF]

1986

[A2] Michel C.J., Jacq B., Arquès D.G., Bickle T.A. (1986). A remarkable amino acid sequence homology between a phage T4 tail fibre protein and ORF314 of phage lambda located in the tail operon. Gene 44, 147-150. [PDF]

[A1] Michel C.J. (1986). New statistical approach to discriminate between protein coding and non-coding regions in DNA sequences and its evaluation. Journal of Theoretical Biology 120, 223-236. [PDF]

Top page