Kwon, A. T., Chou, A. Y., Arenillas, D. J. & Wasserman, W. W. Validation of skeletal muscle cis-Regulatory module predictions reveals nucleotide composition bias in functional enhancers. PLoS Comput Biol 7, e1002256+ (2011). URL http://dx.doi.org/10.1371/journal.pcbi.1002256.
The title is a bit deceiving, as what is really reported in this article is a failure of in-vivo validation of in-silico predicted muscle enhancers. OK. straight to what they have done:
The essence of the results is summarized (and hidden) in this supplementary table 2:
| Region Set | # Samples Allocated | # Cloned and Tested | # Viable Clones | # Validated as Positive | # Validated / # Viable Clones (%) |
| Background | 192 | 88 | 55 | 4 | 7.3 % |
| Non-muscle | 96 | 55 | 37 | 4 | 10.8 % |
| Muscle | 384 | 198 | 186 | 11 | 5.9 % |
One can see that (1) the testing process has fairly high failure rate, such that less than half of the samples could successfully make it to the final tests; (2) among those tested, because the method has high background noise, they setup a high stringency to call a “muscle specific enhancer”. As a result, a very small fraction (<10%) were called. Surprisingly, the positive hits is not at all enriched among the apriori “muscle specific gene set”. If anything, there is a deficit of positive hits in this set (although the difference is not significant according to Fisher’s Exact Test).
So how are these enhancer elements predicted? They used three programs, the Logistic Regression Analysis (LRA), which is the initial idea of using clusters of binding sites (1998), two later variations–MSCAN and ClusterBuster. Note that all of them were based on this idea of clustering of transcription factor binding sites. Five key TF believed to be involved in muscle development were included with their PWM taken from JASPAR. — Q: how good are these PWMs?
They first assembled three sets of genomic regions: muscle gene regulatory regions; non-muscle gene regulatory regions and background conserved sequences. For the first two, they took literature curated (based on microarray of gene expression profile or individual studies) muscle genes and a random set of the non-muscle genes. They conducted their searches effectively in a +/- 10kb region of a gene, that is, the gene locus plus 10kb upstream and downstream, all non-coding regions included. — Q: is 10kb enough to capture most of the enhancers? Will expand this range include more authentic ones and thus raise their heuristic threshold for calling a significant enrichment for enhancers?
One detail worth noticing is that the constructs they tested are very short, at most 400 bp long (1st. and 3rd. quartiles are 259bp and 365 bp). This is surprising to me as even in Drosophila where the enhancers are known to be quite compact, a 300bp enhancer is considered extremely short. — Q: is this short length a result of the prediction algorithms or artificially trimmed to make them amenable to cloning and testing? Critically, if we believe that minimal functional enhancers in human are on average much longer than 400bp (which is not unreasonable given our knowledge about minimal enhancer length in Drosophila and the comparison between human and flies), could this short length be a key contributor to the surprisingly low validation rate???
There is more details pertaining to their methodologies, such as how they determine whether an amplicon drives “muscle specific gene expression” or not, involving the use of three different cell lines. I’m not sure whether the cell line systems are good enough to capture the essential biologies. Naively, suppose we know the truth functional enhancers for muscle development in-vivo, how many of them can be detected in this cell line system? But since those technical details are well beyond my expertise, I’ll leave them to someone who is more qualified at to critique.
Overall, I find the short length of the inserts the most concerning, and to me that may explain the low success rate of validation. But I’m also not surprised if truly the predictions don’t work that well, which suggests additional rules necessary for predicting enhancer elements.