Chapter II: Chromatin topology
II.2: Results
II.2.1: Simplifying ChIA-PET data to elucidate the most reproducible connections ChIA-PET poses a particular problem above and beyond the ordinary noisiness of genome-wide data in that certain areas of the genome have connectivity patterns that are extremely complex (Fig. II-2).
Other ChIA-PET bodies of work have done little to elucidate what is occurring at these complex loci (Fullwood, Liu et al. 2009; Handoko, Xu et al. 2011; Li, Ruan et al.
2012; Chepelev, Wei et al. 2012). To simplify the ChIA-PET raw data, I took a graph
theoretical approach. First, I specified a set of candidate vertices out of regions of the genome likely to be connected and removed all PETs that do not have both ends in a vertex (Fig. II-3).
I used an independent genome-wide assay, DNase-Seq, as the source of the candidate vertices, along with all annotated TSSs. This narrows the pool of connected regions of the genome to genes and occupied putative CRMs, more easily interpretable by current knowledge in the era of ENCODE. Second, for pol2 ChIA-PET, I reported as edges only the places where there were two individual occurrences of ChIA-PET raw paired tags between candidate vertices. In order to focus on the most reproducible, highest-confidence set of interactions, I performed two separate biological and technical ChIA-PET experiments for each condition and I only reported the edges found in both experiments (Fig. II-1). This purposefully sacrifices weak signal at the threshold of noise for high-confidence, reproducible signal so that I can be certain of the existence of the connections I report. A third dataset, myogenin at the myocyte timepoint, has no replicate.
It will be used todetermine which aspects of the pol2 ChIA-PETs are factor-dependent.
Both the raw and processed data are shown for the CIG containing MyoD, one of the master regulators of myogenesis (Fig. II-4), since MyoD is a locus representative of a medium-sized one gene CIG.
A ChIA-PET edge means that there is evidence of a single physical complex that contains two regions of DNA and the factor for which the ChIP was done. Lack of a ChIA- PET edge suggests that either there is no physical connectivity between the regions, or that connectivity occurs without the presence of the ChIPped factor. A common
misconception of ChIA-PET data is that it represents a complete physical connectivity map; it does not (Fig. II-5).
II.2.2: ChIA-PET general characteristics
ChIA-PET connectivity is particularly striking at the myogenic locus containing myogenin (Myog) and myosin binding heavy protein H (Mybph) (Fig. II-6).
At the myoblast timepoint, when both myogenic genes are unexpressed, no connections are recovered. However, they connect to each other as well as many nearby myog+ and myog- DNase-hypersensitive vertices. This locus with around 60
interconnected vertices is in fact spectacularly large compared to most other loci in the genome. Most CIGs are small, though large, multi-genic CIGs like myogenin number in the hundreds (Fig. II-9).
Most CIGs contain at least one gene, but surprisingly, there are CIGs that have no annotated genes. This does not appear to be a characteristic of data stringency (data not shown), so the most likely explanations are that some vertices are unannotated genes (though I used gene models bordering on the extensive), or that pol2 sometimes comes into contact with regions of the genome that don’t have genes.
As for the edges themselves, most are local, and strength is inversely correlated with distance (Fig. II-7). However, there are some long edges over 50kb, even a rare few as long as the 1Mb Shh to enhancer interaction. One related property ofthese local edgesis that the ChIA-PET CIGs themselves are relatively localized (Fig II-9). The elements that the edges connect, gene-vertices and distal-vertices, tend to be wider than unconnected candidate vertices, and gene-vertices are also wider than distal-vertices (data not shown). This is due to the merging algorithm in the creation of candidate vertices: some regions of the genome, particularly the bodies of active genes, have multiple DNase regions blanketing a small area. I have standardized edge weights to account for the differing vertex widths (and therefore edge capture likelihood) by normalizing on the basis of the connected vertex widths.
All of these properties are true for the myogenin ChIA-PET as well. However, there is one notable difference between myogenin and pol2 ChIA-PET edge strengths.
Pol2 edges are strongest when they involve genes (Fig. II-8, top), and myogenin edges are strongest when they involve non-genic elements (Fig. II-8, bottom). Though there is little relationship between ChIA-PET signal and the antecedent ChIP signal (data not shown), it is likely this means that ChIA-PET signal strength is partially influenced by factor occupancy: pol2 at genes and myog at enhancers.
Since ChIA-PET is an assay done in bulk on a large cell population, there is a major question to ask: when a vertex has connections to multiple other vertices, are the interactions simultaneous or sequential? Is there any evidence for the promoter factory hypothesis (Osborne, Chakalova et al. 2004) and if so, is this theexception or the rule? I chose to use the graph theoretical concept of the clique as a way of determining the likelihood of having simultaneous interactions. A clique is a set of vertices where every vertex is connected to every other vertex. (Fig. II-10A, middle; Fig. II-10B, purple). If there are simultaneous interactions captured by ChIA-PET, they would show up as cliques, though not all cliques need be simultaneous interactions (Fig. II-11A). However, because cliques and non-cliques alike are just as susceptible to the rigorous data treatment, it is not their absolute number but the ratio between their numbers that will tell us which type of interaction is mostcommon. This ratio is 8 to 92% regardless of the data set and data treatment (Fig. II-11B; some analyses not shown). There are indeed cliques in the ChIA- PET data, including a clique of the classic MRF myogenin connected to two other upregulated genes (Mybph and Ppfia4) and a few distal elements (Fig. II-10A, left).
However, there are surprisingly few cliques genome-wide, only a few hundred overall (Fig.
II-10A, right). In fact, it appears to be a general principle of these data that there is a very narrow range of observed connectivity: most CIGs have about one extra edge per three
vertices above the absolute minimum level of connectivity (Fig. II-10B, red). Taken all together, the most likely explanation for these phenomena are that most multiple interactions in the nucleus are sequential rather than simultaneous, and that instances such as the promoter factory are the exception rather than the rule.