Joint haplotype phasing and genotype calling of multiple individuals using haplotype informative reads

Academic Article


  • Motivation: Hidden Markov model, based on Li and Stephens model that takes into account chromosome sharing of multiple individuals, results in mainstream haplotype phasing algorithms for genotyping arrays and next-generation sequencing (NGS) data. However, existing methods based on this model assume that the allele count data are independently observed at individual sites and do not consider haplotype informative reads, i.e. reads that cover multiple heterozygous sites, which carry useful haplotype information. In our previous work, we developed a new hidden Markov model to incorporate a two-site joint emission term that captures the haplotype information across two adjacent sites. Although ourmodel improves the accuracy of genotype calling and haplotype phasing, haplotype information in reads covering non-adjacent sites and/or more than two adjacent sites is not used because of the severe computational burden. Results: We develop a new probabilistic model for genotype calling and haplotype phasing from NGS data that incorporates haplotype information of multiple adjacent and/or non-adjacent sites covered by a read over an arbitrary distance. We develop a new hybrid Markov Chain Monte Carlo algorithm that combines the Gibbs sampling algorithm of HapSeq and Metropolis-Hastings algorithm and is computationally feasible. We show by simulation and real data from the 1000 Genomes Project that our model offers superior performance for haplotype phasing and genotype calling for population NGS data over existing methods. © The Author 2013.
  • Authors

    Published In

  • Bioinformatics  Journal
  • Digital Object Identifier (doi)

    Author List

  • Zhang K; Zhi D
  • Start Page

  • 2427
  • End Page

  • 2434
  • Volume

  • 29
  • Issue

  • 19