Observe.Think.Touch Nature

December 29, 2011

Calculate Nucleotide Diversity — per base pair summation method

Filed under: Science Related,Technical notes — hebin @ 9:58 pm

The definition of (pairwise) nucleotide diversity is straightforward: the average distance between two randomly chosen DNA sequence from a population.

The calculation based on this definition is also simple: sum up the pairwise differences (in nucleotides) for all possible pairs, then divide by the total number of such pairs, which yields the desired quantity.

However, sometimes it is easier to perform the calculation on a per base pair basis, i.e. imagine you march along a multiple alignment, where your unit of calculation is a column of aligned sequences. This is particularly true when one deals with a long stretch of sequences, or when the coverage differs at different locations, as is true for today’s short reads generated from NGS.

I did some search on the internet and didn’t find a simple formula for this. Well, in fact, there is one, from this paper. However it takes a bit of time to digest it. Here is a digested version, or more importantly, a way to think of the formula:

Think of the very original definition as two levels of summation: first sum over the length of the sequence for any random pair of DNA from the population, calculate the total number of differences; then sum that over all pairs. The grand total is then divided by the total number of pairs. Imagine if we expand the summations: every unit is just one bp difference at position X between sequence x and y, which contributed ONE to the summation in the numerator. Now let’s ask how many “ONE”s does position X contribute in total to the numerator. That depends on the frequency of either the minor or major allele. Let’s say there are n sequences in total and the minor allele at X occurs j times out of n. Then the total number of “ONE”s is easy to calculate: it is just (j choose 1) * (n-j choose 1), which equals j*(n-j). The shared denominator is (n choose 2), equaling n(n-1)/2. Therefore the following formula:

sum ( 2j(n-j) / n(n-1) )

Note that this formula does a bp-by-bp calculation and thus it seems that one can incorporate the coverage differences into the calculation by varying n for different sites. Indeed, with some simple algebra, one can show that this way of calculation equals the coverage adjusted nucleotide diversity estimate proposed by Begun et al in the paper I mentioned above. This calculation is then fairly straightforward.

December 21, 2011

Failure of Enhancer Prediction or failure of in-vivo validation?

Filed under: Papers — hebin @ 10:37 am

Kwon, A. T., Chou, A. Y., Arenillas, D. J. & Wasserman, W. W. Validation of skeletal muscle cis-Regulatory module predictions reveals nucleotide composition bias in functional enhancers. PLoS Comput Biol 7, e1002256+ (2011). URL http://dx.doi.org/10.1371/journal.pcbi.1002256.

The title is a bit deceiving, as what is really reported in this article is a failure of in-vivo validation of in-silico predicted muscle enhancers. OK. straight to what they have done:

The essence of the results is summarized (and hidden) in this supplementary table 2:

Region Set # Samples Allocated # Cloned and Tested # Viable Clones # Validated as Positive # Validated / # Viable Clones (%)
Background 192 88 55 4 7.3 %
Non-muscle 96 55 37 4 10.8 %
Muscle 384 198 186 11 5.9 %

One can see that (1) the testing process has fairly high failure rate, such that less than half of the samples could successfully make it to the final tests; (2) among those tested, because the method has high background noise, they setup a high stringency to call a “muscle specific enhancer”. As a result, a very small fraction (<10%) were called. Surprisingly, the positive hits is not at all enriched among the apriori “muscle specific gene set”. If anything, there is a deficit of positive hits in this set (although the difference is not significant according to Fisher’s Exact Test).

So how are these enhancer elements predicted? They used three programs, the Logistic Regression Analysis (LRA), which is the initial idea of using clusters of binding sites (1998), two later variations–MSCAN and ClusterBuster. Note that all of them were based on this idea of clustering of transcription factor binding sites. Five key TF believed to be involved in muscle development were included with their PWM taken from JASPAR. — Q: how good are these PWMs?

They first assembled three sets of genomic regions: muscle gene regulatory regions; non-muscle gene regulatory regions and background conserved sequences. For the first two, they took literature curated (based on microarray of gene expression profile or individual studies) muscle genes and a random set of the non-muscle genes. They conducted their searches effectively in a +/- 10kb region of a gene, that is, the gene locus plus 10kb upstream and downstream, all non-coding regions included. — Q: is 10kb enough to capture most of the enhancers? Will expand this range include more authentic ones and thus raise their heuristic threshold for calling a significant enrichment for enhancers?

One detail worth noticing is that the constructs they tested are very short, at most 400 bp long (1st. and 3rd. quartiles are 259bp and 365 bp). This is surprising to me as even in Drosophila where the enhancers are known to be quite compact, a 300bp enhancer is considered extremely short. — Q: is this short length a result of the prediction algorithms or artificially trimmed to make them amenable to cloning and testing? Critically, if we believe that minimal functional enhancers in human are on average much longer than 400bp (which is not unreasonable given our knowledge about minimal enhancer length in Drosophila and the comparison between human and flies), could this short length be a key contributor to the surprisingly low validation rate???

There is more details pertaining to their methodologies, such as how they determine whether an amplicon drives “muscle specific gene expression” or not, involving the use of three different cell lines. I’m not sure whether the cell line systems are good enough to capture the essential biologies. Naively, suppose we know the truth functional enhancers for muscle development in-vivo, how many of them can be detected in this cell line system? But since those technical details are well beyond my expertise, I’ll leave them to someone who is more qualified at to critique.

Overall, I find the short length of the inserts the most concerning, and to me that may explain the low success rate of validation. But I’m also not surprised if truly the predictions don’t work that well, which suggests additional rules necessary for predicting enhancer elements.

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.