Observe.Think.Touch Nature

December 29, 2011

Calculate Nucleotide Diversity — per base pair summation method

Filed under: Science Related,Technical notes — hebin @ 9:58 pm

The definition of (pairwise) nucleotide diversity is straightforward: the average distance between two randomly chosen DNA sequence from a population.

The calculation based on this definition is also simple: sum up the pairwise differences (in nucleotides) for all possible pairs, then divide by the total number of such pairs, which yields the desired quantity.

However, sometimes it is easier to perform the calculation on a per base pair basis, i.e. imagine you march along a multiple alignment, where your unit of calculation is a column of aligned sequences. This is particularly true when one deals with a long stretch of sequences, or when the coverage differs at different locations, as is true for today’s short reads generated from NGS.

I did some search on the internet and didn’t find a simple formula for this. Well, in fact, there is one, from this paper. However it takes a bit of time to digest it. Here is a digested version, or more importantly, a way to think of the formula:

Think of the very original definition as two levels of summation: first sum over the length of the sequence for any random pair of DNA from the population, calculate the total number of differences; then sum that over all pairs. The grand total is then divided by the total number of pairs. Imagine if we expand the summations: every unit is just one bp difference at position X between sequence x and y, which contributed ONE to the summation in the numerator. Now let’s ask how many “ONE”s does position X contribute in total to the numerator. That depends on the frequency of either the minor or major allele. Let’s say there are n sequences in total and the minor allele at X occurs j times out of n. Then the total number of “ONE”s is easy to calculate: it is just (j choose 1) * (n-j choose 1), which equals j*(n-j). The shared denominator is (n choose 2), equaling n(n-1)/2. Therefore the following formula:

sum ( 2j(n-j) / n(n-1) )

Note that this formula does a bp-by-bp calculation and thus it seems that one can incorporate the coverage differences into the calculation by varying n for different sites. Indeed, with some simple algebra, one can show that this way of calculation equals the coverage adjusted nucleotide diversity estimate proposed by Begun et al in the paper I mentioned above. This calculation is then fairly straightforward.

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.