SMILE

Stochastic Models for the Inference of Life Evolution

A novel heuristic for local multiple alignment of interspersed DNA repeats

Treangen, T. J., Darling, A. E., Achaz, G., Ragan, M. A., Messeguer, X., Rocha, E. P. C.

IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM

2009

Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.

Bibtex

@article{treangen_novel_2009,
Author = {Treangen, Todd J. and Darling, Aaron E. and Achaz,
Guillaume and Ragan, Mark A. and Messeguer, Xavier and
Rocha, Eduardo P. C.},
Title = {A novel heuristic for local multiple alignment of
interspersed {DNA} repeats},
Journal = {IEEE/ACM transactions on computational biology and
bioinformatics / IEEE, ACM},
Volume = {6},
Number = {2},
Pages = {180--189},
abstract = {Pairwise local sequence alignment methods have been
the prevailing technique to identify homologous
nucleotides between related species. However, existing
methods that identify and align all homologous
nucleotides in one or more genomes have suffered from
poor scalability and limited accuracy. We propose a
novel method that couples a gapped extension heuristic
with an efficient filtration method for identifying
interspersed repeats in genome sequences. During gapped
extension, we use the MUSCLE implementation of
progressive global multiple alignment with iterative
refinement. The resulting gapped extensions potentially
contain alignments of unrelated sequence. We detect and
remove such undesirable alignments using a hidden
Markov model (HMM) to predict the posterior probability
of homology. The HMM emission frequencies for
nucleotide substitutions can be derived from any
time-reversible nucleotide substitution matrix. We
evaluate the performance of our method and previous
approaches on a hybrid data set of real genomic DNA
with simulated interspersed repeats. Our method
outperforms a related method in terms of sensitivity,
positive predictive value, and localizing boundaries of
homology. The described methods have been implemented
in freely available software, Repeatoire, available
from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.},
doi = {10.1109/TCBB.2009.9},
issn = {1557-9964},
language = {eng},
month = jun,
pmid = {19407343},
year = 2009
}

Link to the article

Accéder à l'article grâce à son DOI.