VG2 Site Clustering

Given two biallelic SNPs (A and B) each with two alleles (A1/A2 and B1/B2 respectively) there are 9 possible two site genotypes (Table 1).

Table 1: Two site genotypes (diplotypes)

 

Site B genotype

 

 

B1B1

B1B2

B2B2

Site A genotype

A1A1

A1A1B1B1

A1A1B1B2

A1A1B2B2

A1A2

A1A2B1B1

A1A2B1B2*

A1A2B2B2

A2A2

A2A2B1B1

A2A2B1B2

A2A2B2B2

 

For eight of nine possible two-site genotypes, the underlying haplotypes are unambiguous. Only for the double heterozygote* are the underlying haplotypes unknown: both A1B1/A2B2 and A1B2/A2B1 are consistent with diplotype A1A2B1B2. Using the unambiguous haplotypes the ambiguous haplotypes may easily be inferred (1), allowing us to pursue analysis at the haplotype level.

 

Table 2: Two site haplotypes

 

Site B Allele

 

 

B1

B2

Site A Allele

A1

A1B1

A1B2

A2

A2B1

A2B2

 

Based on the inferred haplotype frequencies, the linkage disequilibrium statistic r2 describes how similar patterns of genotype are for any pair of SNPs.

 

In order to cluster sites, we calculate r2 for all pairs of sites in a file, and then cluster sites such that sites with similar patterns of genotype (high r2) are shown near one another. We use an unweighted average linkage algorithm (UPGMA) to generate a hierarchical tree of cluster relationships (2). UPGMA initially groups closely related pairs of sites into clusters, and then closely related clusters of sites into larger clusters, until all sites are in a single cluster.  Frequently this results in a visual genotype that appears to divide rare sites from frequent sites, but this is an artifact attributable to the fact that most samples at rare sites are homozygous for the common allele, and therefore are more similar to one another than to higher frequency sites.

 

1. Hill, W.G. Estimation of linkage disequilibrium in randomly mating populations. Heredity 33: 229-239, 1974.

2. Sneath, P.H.A. & Sokal, R.R. 1973. Numerical Taxonomy. W.H. Freeman and Company, San Francisco, pp 230-234.