SMILE

Stochastic Models for the Inference of Life Evolution

Bibtex

@article{raineri_snp_2012,
Author = {Raineri, Emanuele and Ferretti, Luca and
Esteve-Codina, Anna and Nevado, Bruno and Heath, Simon
and Pérez-Enciso, Miguel},
Title = {{SNP} calling by sequencing pooled samples},
Journal = {BMC bioinformatics},
Volume = {13},
Pages = {239},
abstract = {BACKGROUND: Performing high throughput sequencing on
samples pooled from different individuals is a strategy
to characterize genetic variability at a small fraction
of the cost required for individual sequencing. In
certain circumstances some variability estimators have
even lower variance than those obtained with individual
sequencing. SNP calling and estimating the frequency of
the minor allele from pooled samples, though, is a
subtle exercise for at least three reasons. First,
sequencing errors may have a much larger relevance than
in individual SNP calling: while their impact in
individual sequencing can be reduced by setting a
restriction on a minimum number of reads per allele,
this would have a strong and undesired effect in pools
because it is unlikely that alleles at low frequency in
the pool will be read many times. Second, the prior
allele frequency for heterozygous sites in individuals
is usually 0.5 (assuming one is not analyzing sequences
coming from, e.g. cancer tissues), but this is not true
in pools: in fact, under the standard neutral model,
singletons (i.e. alleles of minimum frequency) are the
most common class of variants because P(f) ∝ 1/f and
they occur more often as the sample size increases.
Third, an allele appearing only once in the reads from
a pool does not necessarily correspond to a singleton
in the set of individuals making up the pool, and vice
versa, there can be more than one read - or, more
likely, none - from a true singleton. RESULTS: To
improve upon existing theory and software packages, we
have developed a Bayesian approach for minor allele
frequency (MAF) computation and SNP calling in pools
(and implemented it in a program called snape): the
approach takes into account sequencing errors and
allows users to choose different priors. We also set up
a pipeline which can simulate the coalescence process
giving rise to the SNPs, the pooling procedure and the
sequencing. We used it to compare the performance of
snape to that of other packages. CONCLUSIONS: We
present a software which helps in calling SNPs in
pooled samples: it has good power while retaining a low
false discovery rate (FDR). The method also provides
the posterior probability that a SNP is segregating and
the full posterior distribution of f for every SNP. In
order to test the behaviour of our software, we
generated (through simulated coalescence) artificial
genomes and computed the effect of a pooled sequencing
protocol, followed by SNP calling. In this setting,
snape has better power and False Discovery Rate (FDR)
than the comparable packages samtools, PoPoolation,
Varscan : for N = 50 chromosomes, snape has power ≈
35\%and FDR ≈ 2.5\%. snape is available at
http://code.google.com/p/snape-pooled/ (source code and
precompiled binaries).},
doi = {10.1186/1471-2105-13-239},
issn = {1471-2105},
language = {eng},
pmcid = {PMC3475117},
pmid = {22992255},
year = 2012
}