Change-Point Detection in Binary Markov DNA Sequences by Cross-Entropy Method
A deoxyribonucleic acid (DNA) sequence can be represented as a sequence with 4 characters. If a particular property of the DNA is studied, for example, GC content, then it is possible to consider a binary sequence. In many cases, if the probabilistic properties of a segment differ from the neighbouring ones, this means that the segment can play a structural role. Therefore, DNA segmentation is given a special attention, and it is one of the most significant applications of change-point detection. Problems of this type also arise in a wide variety of areas, for example, seismology, industry (e.g, fault detection), biomedical signal processing, financial mathematics, speech and image processing. In this study, we have developed a Cross-Entropy algorithm for identifying change-points in binary sequences with first-order Markov dependence. We propose a statistical model for this problem and show effectiveness of our algorithm for synthetic and real datasets.(original abstract)
- Avery P. and Henderson D., "Detecting a changed segment in DNA sequences," Appl. Statist., vol. 48, no. 4, pp. 489-503, 1999, http://dx.doi.org/10.1111/1467-9876.00167
- Avery P. and Henderson D., "Fitting Markov chain models to discrete state series such as DNA sequences," Appl. Statist., vol. 48, no. 1, pp. 53-61, 1999, http://dx.doi.org/10.1111/1467-9876.00139
- Bernardi G., Structural and evolutionary genomics. Natural Selection in Genome Evolution. Amsterdam: Elsevier, 2004.
- Botev Z. I., Kroese D., and Taimre T., "Generalized crossentropy methods with applications to rare-event simulation and optimization," Simulation, vol. 83, no. 11, pp. 785-806, 2007, http://dx.doi.org/10.1177/0037549707087067
- Boys R. and Henderson D., "A Bayesian approach to DNA sequence segmentation," Biometrics, vol. 60, pp. 573-588, 2004, http://dx.doi.org/10.1111/j.0006-341X.2005.040701_1.x
- Braun J., Braun R., and Müller H., "Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation," Biometrika, vol. 87, no. 2, pp. 301-314, 2000, http://dx.doi.org/10.1093/biomet/87.2.301
- Costa A., Jones O., and Kroese D., "Convergence properties of the cross- entropy method for discrete optimization," Operations Research Letters, vol. 35, no. 5, pp. 573-580, 2007, http://dx.doi.org/10.1016/j.orl.2006.11.005
- Costantini M., Clay O., Auletta F., and Bernardi G., "An isochore map of human chromosomes," Genome Res., vol. 16, pp. 536-541, 2006, http://dx.doi.org/10.1101/gr.4910606
- Evans G. E., Sofronov G. Y., Keith J. M., and Kroese D. P., "Estimating change-points in biological sequences via the cross-entropy method," Ann. Oper. Res., vol. 189, no. 1, pp. 155-165, 2011, http://dx.doi.org/10.1007/s10479-010-0687-0
- Hurst L., Brunton C., and Smith N., "Small introns tend to occur in GC-rich regions in some but not all vertebrates," Trends Genet, vol. 15, pp. 437-439, 1999, http://dx.doi.org/10.1016/S0168-9525(99)01832-6
- Ikemura T. and Wada K., "Evident diversity of codon usage patterns of human genes with respect to chromosome banding patterns and chromosome numbers; relation between nucleotide sequence data and cytogenetic data," Nucleic Acids Res., vol. 19, pp. 4333-4339, 1991.
- Keith J. M., "Segmenting eukaryotic genomes with the generalized Gibbs sampler," J. Comp. Biol., vol. 13, no. 7, pp. 1369-1383, 2006, http://dx.doi.org/10.1089/cmb.2006.13.1369
- Keith J. M., Adams P., Stephen S., and Mattick J. S., "Delineating slowly and rapidly evolving fractions of the drosophila genome," J. Comp. Biol., vol. 15, no. 4, pp. 407-430, 2008, http://dx.doi.org/10.1089/cmb.2007.0173
- Keith J. M., Kroese D. P., and Bryant D., "A generalized Markov sampler," Methodology and Computing in Applied Probability, vol. 6, no. 1, pp. 29-53, 2004, http://dx.doi.org/10.1023/B:MCAP.0000012414.14405.15
- Keith J., Kroese D., and Sofronov G., "Adaptive independence samplers," Statistics and Computing, vol. 18, no. 4, pp. 409-420, 2008, http://dx.doi.org/10.1007/s11222-008-9070-2
- Krauth J., "Multiple change points and alternating segments in binary trials with dependence," in Innovations in Classification, Data Science, and Information Systems, D. Baier and K. Wernecke, Eds. Springer, Berlin, 2004, pp. 154-164, http://dx.doi.org/10.1007/3-540-26981-9_19
- Krauth J., "Tests for multiple change points in binary Markov sequences," in From Data and Information Analysis to Knowledge Engineering. Springer, Berlin, 2006, pp. 670-677, http://dx.doi.org/10.1007/3-540-31314-1_82
- López I., Gámez M., Garay J., Standovár T., and Varga Z., "Application of change-point problem to the detection of plant patches," Acta Bio-theoretica, vol. 58, pp. 51-63, 2010, http://dx.doi.org/10.1007/s10441-009-9093-x
- Oliver J., Bernaola-Galvan P., Carpena P., and Roman-Roldan R., "Isochore chromosome maps of eukaryotic genomes," Gene, vol. 276, pp. 47-56, 2001, http://dx.doi.org/10.1016/S0378-1119(01)00641-2
- Oliver J., Carpena P., Roman-Roldan R., Mata-Balaguer T., Mejias-Romero A., Hackenberg M., and Bernaola-Galvan P., "Isochore chromosome maps of the human genome," Gene, vol. 300, pp. 117-127, 2002, http://dx.doi.org/10.1016/S0378-1119(02)01034-X
- Oliver J., Roman-Roldan R., Perez J., and Bernaola-Galvan P., "Segment: identifying compositional domains in DNA sequences," Bioinformatics, vol. 15, pp. 974-979, 1999, http://dx.doi.org/10.1093/bioinformatics/15.12.974
- Pettitt A., "A non-parametric approach to the changepoint problem," Appl. Statist., vol. 28, pp. 126-135, 1979, http://dx.doi.org/10.2307/2346729
- Polansky A., "Detecting change-points in Markov chains," Computational Statistics and Data Analysis, vol. 51, pp. 6013-6026, 2007, http://dx.doi.org/10.1016/j.csda.2006.11.040
- Polushina T. and Sofronov G., "Change-point detection in biological sequences via genetic algorithm," in Proceedings of the IEEE Congress on Evolutionary Computation (CEC'2011), 2011, pp. 1966-1971, http://dx.doi.org/10.1109/CEC.2011.5949856
- Polushina T. V. and Sofronov G. Y., "A hybrid genetic algorithm for change-point detection in binary biomolecular sequences," in Proceedings of the IASTED International Conference on Artificial Intelligence and Applications (AIA 2013), 2013, pp. 1-8, http://dx.doi.org/10.2316/P.2013.793-026
- Priyadarshana M. and Sofronov G., "GAMLSS and Extended Cross-Entropy Method to Detect Multiple Change-Points in DNA Read Count Data," in Proceedings of the 28th International Workshop on Statistical Modelling, Muggeo VMR, Capursi V, Boscaino G, Lovison G (Eds.), 2013, vol.1, pp. 453-457.
- Priyadarshana M. and Sofronov G., "The Cross-Entropy method and multiple change-points detection in zero-inflated DNA read count data," in The 4th International Conference on Computational Methods (ICCM2012), Y. T. Gu, S. C. Saha (Eds.), 2012 , pp. 1-8.
- Priyadarshana M., and Sofronov G., "Breakpoint (R-package);" software available at http://cran.r-project.org/web/packages/breakpoint.
- Priyadarshana M., Polushina T., and Sofronov G., "A hybrid algorithm for multiple change-point detection in continuous measurements," in International Symposium on Computational Models for Life Sciences, AIP Conference Proceedings, vol. 1559, 2013, pp. 108-117, http://dx.doi.org/10.1063/1.4825002
- Priyadarshana M., Polushina T., and Sofronov G., "Hybrid algorithms for multiple change-point detection in biological sequences," in Signal and Image Analysis for Biomedical and Life Sciences (Advances in Experimental Medicine and Biology), Sun, C., Bednarz, T., Pham, T. D., Vallotton, P., Wang, D. (Eds.), Springer, 2014, in press.
- Priyadarshana M., Sofronov G., "A modified cross-entropy method for detecting change-points in the Sri-Lankan stock market," In: The IASTED International Conference on Engineering and Applied Science (EAS2012), Chen, B. M.; Khan, M. T. and Tan, K-K. (Eds.), 2012, pp. 321-326, http://dx.doi.org/10.2316/P.2012.785-041
- Ren L., Gao G., Zhao D., Ding M., Luo J., and Deng H., "Developmental stage related patterns of codon usage and genomic GC content: searching for evolutionary fingerprint by models of stem cell differentiation," Genome Biol., vol. 8, p. R35, 2007, http://dx.doi.org/10.1186/gb-2007-8-3-r35
- Rubinstein R. and Kroese D., Simulation and the Monte Carlo Method. John Wiley & Sons, 2007.
- Rubinstein R. and Kroese D., The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning. New York: Springer-Verlag, 2004.
- Schwarz G., "Estimating the dimension of a model," The Annals of Statistics, vol. 6, no. 2, pp. 461-464, 1978, http://dx.doi.org/10.1214/aos/1176344136
- Semon M., Mouchiroud D., and Duret L., "Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance," Hum. Mol. Genet., vol. 14, pp. 421-427, 2005, http://dx.doi.org/10.1093/hmg/ddi038
- Sofronov G. Y., Evans G. E., Keith J. M., and Kroese D. P., "Identifying change-points in biological sequences via sequential importance sampling," Environmental Modeling and Assessment, vol. 14, no. 5, pp. 577-584, 2009, http://dx.doi.org/10.1007/s10666-008-9160-8
- Sofronov G., "Change-point modelling in biological sequences via the bayesian adaptive independent sampler," International Proceedings of Computer Science and Information Technology, vol. 5, pp. 122-126, 2011.
- Sofronov G., Polushina T., and Priyadarshana M., "Sequential changepoint detection via the cross-entropy method," in The 11th Symposium on Neural Network Applications in Electrical Engineering (NEUREL2012), B. Reljin and S. Stankovic, (Eds.), 2012, pp. 185-188, http://dx.doi.org/10.1109/NEUREL.2012.6420004
- Sonesson C. and Bock D., "A review and discussion of prospective statistical surveillance in public health," Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 166, no. 1, pp. 5-21, 2003, http://dx.doi.org/10.1111/1467-985X.00256
- Spencer C., Deloukas P., Hunt S., Mullikin J., Myers S., Silverman B., Donnelly P., Bentley D., and McVean G., "The influence of recombination on human genetic diversity," PLoS Genet, vol. 2, p. e148, 2006, http://dx.doi.org/10.1371/journal.pgen.0020148
- Takano-Shimizu T., "Local changes in GC/AT substitution biases and in crossover frequencies on drosophila chromosomes," Mol. Biol. Evol., vol. 18, pp. 606-619, 2001, http://dx.doi.org/10.1093/oxfordjournals.molbev.a003841
- The MHC Sequencing Consortium, "Complete sequence and gene map of a human major histocompatibility complex," Nature, vol. 401, no. 6756, pp. 921-923, 1999, http://dx.doi.org/10.1038/44853
- Thomson J. R., Kimmerer W. J., Brown L. R., Newman K. B., Mac Nally R., Bennett W. A., Feyrer F., and Fleishman E., "Bayesian change point analysis of abundance trends for pelagic fishes in the upper San Francisco estuary," Ecological Applications, vol. 20, no. 5, pp. 1431-1448, 2010, http://dx.doi.org/10.1890/09-0998.1
- Thurman R., Day N., Noble W., and Stamatoyannopoulos J., "Identification of higher-order functional domains in the human ENCODE regions," Genome Res., vol. 17, pp. 917-927, 2007, http://dx.doi.org/10.1101/gr.6081407
- Vinogradov A., "Dualism of gene GC content and CpG pattern in regard to expression in the human genome: magnitude versus breadth," Trends Genet, vol. 21, pp. 639-643, 2005, http://dx.doi.org/10.1016/j.tig.2005.09.002
- Vinogradov A., "Isochores and tissue-specificity," Nucleic Acids Res., vol. 31, pp. 5212-5220, 2003, http://dx.doi.org/10.1093/nar/gkg699
- Whittaker J. and Frühwirth-Schnatter S., "A dynamic changepoint model for detecting the onset of growth in bacteriological infections," Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 43, no. 4, pp. 625-640, 1994, http://dx.doi.org/10.2307/2986261
- Willams E. and Hurst L., "The proteins of linked genes evolve at similar rates," Nature, vol. 407, pp. 900-903, 2000, http://dx.doi.org/10.1038/35038066
- Xu H., Wei C., Lin F., and Sung W., "An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data," Bioinformatics, vol. 24, pp. 2344-2349, 2008, http://dx.doi.org/10.1093/bioinformatics/btn402
- Yao Y., "Estimating the number of change-points via Schwarz criterion," Statistics and Probability Letters, vol. 6, pp. 181-189, 1988, http://dx.doi.org/10.1016/0167-7152(88)90118-6
- Zeitouni B., Boeva V., Janoueix-Lerosey I., O. Delattre, A. Nicolas, and E. Barillot, "SVDetect - a bioinformatic tool to identify genomic structural variations from paired-end next-generation sequencing data," Bioinformatics, vol. 26, pp. 1895-1896, 2010, http://dx.doi.org/10.1093/bioinformatics/btq293
- Zhang N. and Siegmund D., "A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data," Biometrics, vol. 3, pp. 22-32, 2007, http://dx.doi.org/10.1111/j.1541-0420.2006.00662.x