VI ALIO/EURO Workshop on Applied Combinatorial Optimization

December 15 - 17, 2008, Buenos Aires, Argentina

The pure parcimony haplotyping: overview and mixed integer linear programming models

Daniele Catanzaro and Martine Labbé

Département d’Informatique, Faculté des Sciences

Université Libre de Bruxelles

Most vegetal and animal cells are diploid, i.e. they have two similar (but not identical) copies of each chromosome. In general, individuals from the same species are genetically "very close", as for instance humans: the DNA between two random people is about 99.9 % identical. The individual uniqueness lies in the small differences of bases that can exist where single base DNA differences occur (SNPs, Single Nucleodite Polymorphisms).

The SNP data taken from a single chromosome’s copy is called haplotype and the mixed data on the two copies is called genotype. In general, it is not feasible to examine the two copies of a chromosome separately, and genotype data rather than haplotype data will be obtained, even though it is the haplotype data that will be of greatest use.

Haplotyping estimation from aligned Single Nucleotide Polymorphism (SNP) fragments has attracted more and more attention in the recent years due to its importance in analysis of many fine-scale genetic data. Its application fields range from mapping of complex disease genes to inferring population histories, passing through designing drugs, functional genomics and pharmacogenetics.

From a combinatorial perspective, a genotype can be abstractly expressed as a vector of length n and each value in the vector is either 0,1, or 2. Given a set of genotypes, the Haplotyping Problem consists of determining a pair of haplotypes which are represented as binary vectors such that for any genotype g, the associated binary vectors h, h’ must both have value 0 (or 1) at any position where g has value 0 (or 1), but for any position where g has value 2, exactly one of h, h’ must have value 0, while the other has value 1.

The literature proposes a number of estimation criteria to select a set of haplotypes among possible alternatives. One of the most important estimation criteria is the pure parsimony which states that the optimal set of haplotypes for a given set of genotypes is the one having minimal cardinality. Finding the minimal number of haplotypes necessary to explain a given set of genotypes involves solving an optimization problem, called the Pure Parsimony Haplotyping (PPH) estimation problem, which is notoriously NP-Hard. We present an overview of PPH and discuss the different approaches to solution that occur in the literature, with a particular emphasis on the integer programming models.