• Tidak ada hasil yang ditemukan

Data availability

3.5.9 TFBS motif sequence specificity

We evaluated the sequence specificity of JASPAR core vertebrate non-redundant sequences with significant ChIP-seq TFBS enrichment in core or derived HepG2 or K562 enhancers. Specifically, we calculate the Kullback-Leibler divergence of the motif from genomic background nucleotide frequencies for A/T (0.3) and GC(0.2), similar to the previously described procedure (Li and Wunderlich (2017)). For all ChIP-seq TFBS motifs (regardless of significant enrichment), we assigned these motifs to core or derived regions if they were more often enriched in core over derived sequences and vice versa.

3.5.10 eQTL enrichment

The enrichment for GTEx eQTL from 46 tissues (last downloaded July 23rd, 2019) in core and derived enhancer sequences was tested against matched background sets. In this analysis we considered 500 matched sets. Median fold-change was calculated as the number of eQTLs overlapping enhancer sequence components (i.e., core or derived) compared with the appropriate random sets. Confidence intervals (CI = 95%) were generated by 10,000 bootstraps. P-values were corrected for multiple hypothesis testing by controlling the false discovery rate (FDR) at 5% using the Benjamini-Hochberg procedure.

3.6 Data availability

• Hg19 46-way vertebrate species multiz alignment - https://hgdownload.soe.ucsc.edu/gbdb/hg19/multiz46way/

• Hg38 100-way vertebrate species multiz alignment - https://hgdownload.soe.ucsc.edu/gbdb/hg38/multiz100way/

• LINSIGHT (Huang et al. (2017)) - http://compgen.cshl.edu/LINSIGHT/LINSIGHT.bw Source code is freely available at:

https://github.com/slifong08/enh ages

Supplemental: Function and constraint in enhancer sequences with multiple evolutionary origins

September 5, 2022

List of Figures

S1 Number of derived regions per complex enhancer . . . 2 S2 Sequence lengths of derived, core, and simple transcribed enhancers . . . 3 S3 Lengths of derived, core, and simple enhancers versus expectation, stratified by core age . . 4 S4 Sequence identity of core and derived regions is consistent with sequence identity of shuffled

core and derived sequences . . . 5 S5 Frequency and count of core-derived sequence age pairs in FANTOM . . . 6 S6 Core and derived evolutionary features in complex HepG2 cCREs recapitulate evolutionary

features in FANTOM eRNAs . . . 7 S7 Regions with no TFBS ChIP-seq binding are observed across ages . . . 8 S8 Transcription factor binding site density is similar across ages in HepG2 cCREs . . . 9 S9 Core and derived evolutionary features in complex K562 cCREs recapitulate evolutionary

features in FANTOM eRNAs . . . 10 S10 Derived regions have high transcription factor binding site densities and bind different tran-

scription factors compared to core regions in K562 cells . . . 11 S11 High TFBS density in core regions correlates with high TFBS density in derived regions within

the same HepG2 enhancer sequence . . . 12 S12 High TFBS density in core regions correlates with high TFBS density in derived regions within

the same K562 enhancer sequence . . . 13 S13 Information content of TFBS motifs in derived sequences is comparable with motifs in core

sequences in HepG2 and K562 enhancers . . . 14 S14 MPRA activity is similar across sequence ages and simple, core, or derived contexts . . . 15 S15 Derived regions experienced weaker purifying selection than cores and simple enhancers of

the same age. . . 16 S16 Derived regions experienced weaker purifying selection than cores and simple enhancers

with the same age as their corresponding core region. . . 17 S17 Derived regions have higher minor allele frequencies than core regions across human popu-

lations. . . 18 S18 Both complex enhancers with three or more sequence ages and simple enhancers have less

purifying selection pressures at sequence edges across ages. . . 19 S19 Derived regions have higher SNP densities than adjacent core regions . . . 20 S20 ChIP-seq TFBS binding frequency in core and derived regions of HepG2 and K562 complex

enhancers from ENCODE . . . 21 S21 GC density in FANTOM enhancer and promoter regions . . . 21

1 120

num derived regions per enh 0.0

0.2 0.4 0.6 0.8 1.0

Proportion

2 4 6 8

FANTOM eRNA

num derived regions per enh 0.0

0.2 0.4 0.6 0.8 1.0

Proportion

0 5 10 15 20

HepG2 cCRE

num derived regions per enh 0.0

0.2 0.4 0.6 0.8 1.0

Proportion

0 5 10 15 20

K562 cCRE

Figure S1: Number of derived regions per complex enhancer

Most complex enhancers have one derived region. Cumulative distribution plots show the number of derived regions as proportion of the total complex enhancer sequences for FANTOM5 eRNA (left, N = 10851), HepG2 cCREs from ENCODE (middle, N = 27289) and K562 cCREs from ENCODE (right, N = 24415). Complex enhancers have a median of one derived region across datasets.

WMQTPI

GSQTPI\ GSVI

GSQTPI\ HIVMZIH

W]RXIRMGPIRKXL

*%2831 7,9**0)

Figure S2: Sequence lengths of derived, core, and simple transcribed enhancers

Derived regions are also shorter than expectedfrom 100 sets of length-, chromosome-, and architecture- matched random non-coding regions (left; median 136 bp derived v. 157 bp shuffled, p = 1.4e-46). Core sequences in complex enhancers are longer than 100x non-coding, chromosome-matched shuffled back- ground cores (right; median 173 bp core v. 143 bp shuffle core, p = 2.4e-75). Sample size is annotated for each bar

3 122

LSQSTVMQ

IYEV

FSVI

IYXLXLIV

QEQEQRM

XIXVZIVX

W]RXIRMGPIRKXL

WMQTPI*%2831 WMQTPI7,9**0) GSQTPI\CGSVI*%2831 GSQTPI\CGSVI7,9**0) GSQTPI\CHIVMZIH*%2831 GSQTPI\CHIVMZIH7,9**0)

Figure S3: Lengths of derived, core, and simple enhancers versus expectation, stratified by core age

Derived, core and simple sequence lengths stratified by core age (x-axis) and compared with 100x shuffled sequences matched on core sequence age and architecture. Derived sequences are shorter than expected at every age except those with Vertebrate cores. Core sequences from the Eutherian ancestor and older are longer than expected.

Vertebrata (615)Tetrapoda (352)Amniota (312) Mammalia (177)

Theria (159) Eutheria (105)

Boreoeutheria (96) Euarchontoglires (90)

Primate (74)

sequence age 0.0

0.2 0.4 0.6 0.8 1.0

percent identity 80 676 65 428 152 1083 412 1780 485 3255 2014 36625 351 10094 44 1404 17 7526

core

Tetrapoda (352)Amniota (312) Mammalia (177)

Theria (159) Eutheria (105)

Boreoeutheria (96) Euarchontoglires (90)

Primate (74)

sequence age 0.0

0.2 0.4 0.6 0.8 1.0

percent identity 36 442 69 853 209 1199 534 2823 3421 23278 1590 26681 317 7884 1631 57952

derived

erna shuf

Figure S4: Sequence identity of core and derived regions is consistent with sequence identity of shuffled core and derived sequences

Core and derived sequence identities to their most distant detectable homolog are not significantly dif- ferent from expected. Core (left) and derived (right) FANTOM sequence identity was quantified as the number of nucleotide mismatches between hg19 and the most distant aligned species (Methods). Stratified by sequence age (x-axis) and compared with their expected sequence identities based on 100x shuffled sequences matched on sequence age and architecture, derived and core sequences do not show signif- icantly different sequence identities (Welch’s p-value ¿0.05). Therian and Eutherian cores have slightly lower sequence identity compared with the expectation (median 0.51 Therian core v. 0.52 expected The- rian core and 0.64 Eutherian core v. 0.65 expected Eutherian core, Welch’s p-value ¡ 0.05). Moreover, the sequence identities are well above the range at which detecting homology becomes challenging for all branches except Amniota, which only contains a very small number of derived regions on it (69) or adjacent branches (36 and 209). These results do not show any evidence of systematic mis-classification of the age of enhancer segments due to varying rates of sequence divergence. The number of elements in each cat- egory is annotated below the boxplot. Boxes show the median and interquartile range of sequence identity values. Whiskers reflect 1.5x the interquartile range.

5 124

Vert Tetr Amni Mam Ther Euth Bore Euar Prim Homo Core age

Vert Tetr Amni Mam Ther Euth Bore Euar Prim Homo

Derived age

0 0 0 0 0 0 0 0 0 0

193 0 0 0 0 0 0 0 0 0

171 195 0 0 0 0 0 0 0 0 226 160 455 0 0 0 0 0 0 0 190 114 4431144 0 0 0 0 0 0 486 201 44814392273 0 0 0 0 0 107 42 48 151 1282129 0 0 0 0 25 7 14 19 33 496 138 0 0 0 66 21 18 53 61 1149 478 100 0 0 4 0 0 2 1 11 5 2 24 0

N core-derived age pairing

0.25 0.5 1 2 4

OR fantom v. shuffle (log2-scaled)

Vert Tetr Amni Mam Ther Euth Bore Euar Prim Homo

Core age Vert

Tetr Amni Mam Ther Euth Bore Euar Prim Homo

Derived age

1 0 0 0 0 0 0 0 0 0

0.2 1 0 0 0 0 0 0 0 0

0.18 0.4 1 0 0 0 0 0 0 0

0.240.330.46 1 0 0 0 0 0 0 0.2 0.230.450.53 1 0 0 0 0 0 0.510.410.460.670.98 1 0 0 0 0 0.11 0.090.050.070.060.69 1 0 0 0 0.030.010.010.010.010.160.24 1 0 0 0.070.040.020.020.030.370.840.98 1 0 0 0 0 0 0 0 0.010.020.09 0

FANTOM % core-derived age pair

0.0 0.2 0.4 0.6 0.8 1.0

% core-derived age pair per core age

Figure S5: Frequency and count of core-derived sequence age pairs in FANTOM

Frequency (left) and count (right) of core-derived age pairs across complex FANTOM enhancers. Shad- ing in the frequency plot (left) reflects the percentage of age-pairs within a single core age. Cores may have more than one derived sequence of a different age, thus the sum of the columns can be greater than one.

Shading in the count plot (right) reflects the enrichment of the core-derived age pair compared with shuffled expectation shown in Figure 3.

simple core derived 0

50 100 150 200 250

syntenic length

HepG2 cCRE SHUFFLE

Vert Sarg Tetr Amni Mam Ther Euth Bore Euar Prim Homo

core age Vert

Sarg Tetr Amni Mam Ther Euth Bore Euar Prim Homo

derived age

1.4 1.4 3.5 2.8 3.3 1.9 1.4 1.2

0.8 1.3 1.2

0.5 0.6 0.6 0.9 5.8 0.2 0.4 0.3 0.5 0.6 0.5 1.4 0.3 0.4 0.3 0.4 0.5 0.4 0.8 0.1 0.2 0.2 0.2 0.4 0.3 0.6

0 0.2 0.1 0.4 0.2 0.3

HepG2 cCRE core-derived age pair

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

OR (log2-scaled) HomoPrimEuarBoreEuthTherMamAmniTetrSargVert

Sequence age 5

4 3 2 1 0 1 2

Fold-Change v. Bkgd (log2-scaled) 0714480470503729151811211

1429

689

11437 HomoPrimEuarBoreEuthTherMamAmniTetrSargVert

Sequence age 4

3 2 1 0 1 2 3

Fold-Change v. Bkgd (log2-scaled) 28341924637543132306740341795242116

15680

Core Derived

Figure S6: Core and derived evolutionary features in complex HepG2 cCREs recapitulate evolutionary features in FANTOM eRNAs

Derived regions constitute a sizeable portion of complex HepG2 cCREs (N = 27,789 cCREs), are shorter (top left) and older (top right) than expected compared to shuffled complex enhancer architectures (N = 1,047,557). Core sequences from the Mammalian ancestor and older are enriched for derived sequences from the Therian ancestor and older compared with shuffled expectation of core-derived age pairs. These core sequences are also depleted of sequences younger than the Therian ancestor. Core sequences are enriched for the nearest, younger phylogenetic neighbor. Odds ratio of significantly enriched age-pairs (FDR<0.05) are annotated.

7 126

Homo Prim Euar Bore Euth Ther Mam Amni Tetr Sarg Vert sequence age

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

fraction of arch w zero TFBS 0 2136 3 31 531 76 20 394 21 2 2010 24 43 189 1161 412 141 1301 227 72 163325 855 570 1650 2288 1027 473 1746 408 213 0

K562

simple complex_core complex_derived

Homo Prim Euar Bore Euth Ther Mam Amni Tetr Sarg Vert

sequence age 0.0

0.1 0.2 0.3 0.4 0.5

fraction of arch w zero TFBS 0 2152 11 53 1375 206 57 897 35 16 4440 16 53 222 1591 971 370 2829 388 169 279325 1233 737 2534 4664 2179 938 2952 643 370 0

HepG2

simple complex_core complex_derived

Figure S7: Regions with no TFBS ChIP-seq binding are observed across ages

Similar proportions of derived, core, and simple enhancer sequences have no evidence of TFBS binding within sequence ages in HepG2 and K562 cCREs. K562 cell models generally have fewer elements that do not overlap TFBS, likely because more TFBS ChIP-seq assays have been performed in K562 cells compared with HepG2 cells (249 v. 119 assays, respectively). Enhancer regions are binned according to their syntenic sequence ages. Frequency is calculated as the percent of regions that do not overlap TFBS ChIP-seq peaks within each sequence age and region category. HepG2 is shown above and K562 is shown below. Number of regions with zero TFBS overlap is annotated for each bar.

Homo Prim Euar Bore Euth Ther Mam Amni Tetr Sarg Vert core_mrca_2

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

tf_density

simple

complex_core complex_derived

Figure S8: Transcription factor binding site density is similar across ages in HepG2 cCREs

Simple, core, and derived sequences are stratified by core age on the x-axis. TFBS density per archi- tecture and age was measured and plotted on the y-axis.

9 128

10495

core derived

simple core derived

0 50 100 150 200 250

syntenic length

K562 cCRE SHUFFLE

Vert Sarg Tetr Amni Mam Ther Euth Bore Euar Prim Homocore age Vert

Sarg Tetr Amni Mam Ther Euth Bore Euar Prim Homo

derived age

1.4

3.7 2.4 3.3 2.2 1.3 1.1

0.8 1.2

0.5 0.6 0.6 3.8 0.3 0.3 0.4 0.6 0.5 0.5 1.3 0.4 0.2 0.5 0.5 0.6 0.6 0.9 0.1 0.2 0.3 0.4 0.4 0.4 0.8

0 0.2 0 0.4 0.3

K562 cCRE core-derived age pair

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

OR (log2-scaled) HomoPrimEuarBoreEuthTherMamAmniTetrSargVert

sequence age 3

2 1 0 1 2

Fold-Change v. Bkgd (log2-scaled) 016196935662526459709504137944411353 538128749073215614000 HomoPrimEuarBoreEuthTherMamAmniTetrSargVert

sequence age 3

2 1 0 1 2 3

Fold-Change v. Bkgd (log2-scaled) 31415226197251

Figure S9: Core and derived evolutionary features in complex K562 cCREs recapitulate evolutionary features in FANTOM eRNAs

Derived regions constitute a sizeable portion of complex K562 cCREs (N = 24,415 cCREs), are shorter (top left) and older (top right) than expected compared to shuffled complex enhancer architectures (N

= 473,387 cCREs). Core sequences from the Amniota ancestor and older are enriched for derived se- quences from the Mammalian ancestor and older compared with shuffled expectation of core-derived age pairs. These core sequences are also depleted of sequences younger than the Mammalian ancestor. Core sequences are enriched for the nearest, younger phylogenetic neighbor. Odds ratio of significantly enriched age-pairs (FDR<0.05) are annotated.

0.5 1.0 2.0

C

BoreoeutheriaEutheriaTheriaMammaliaAmniotaVertebrata

A

SAP30 SUZ12 WHSC1 PCBP1 PTBP1 SIN3A BRD4 ZBTB11 FIP1L1 SP1 UBTF KDM5B ETS1 SMAD1 ZSCAN29 XRCC5 ZNF282 ZBTB7A ZBTB40 ELK1 POLR2G HNRNPK MLLT1

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

* EGR1 *

IRF1 PRDM10

*

* ZNF184

ZMYM3

*

* SMARCA4 *

MAFF * MAFK TAF9B TBL1XR1 BRD9 ZNF316

*

*

*

*

* JUN GATA1 HDAC3 TAL1 SOX6 TRIM24 ARID1B CBFA2T2 NCOR1 TCF12 MTA2 EP300 CBFA2T3 EHMT2 ZNF384 ZBTB5 ZNF407 NCOA2 ZBTB8A

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

RBM14 *

BoreoeutheriaEutheriaTheriaMammaliaAmniotaVertebrata

K562 TFBS Core v. Derived

Derived CoreOdds Ratio (log2-scaled)

Core Derived

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

TFBS per bp

K562

*

B

Figure S10: Derived regions have high transcription factor binding site densities and bind different transcription factors compared to core regions in K562 cells

(A) Derived regions (N = 23868) have higher TFBS densities than core regions (N = 20997) (0.074 derived v. 0.062 core TFBS per base pair, Mann Whitney-U p = 3.5e-52). Simple enhancer TFBS density is lower than core and derived regions (0.05 TFBS per base pair)(B)TFBS density is positively correlated between core-derived sequence pairs within complex enhancers with evidence of TF binding in both core and derived regions (N = 14142). Color intensity represents the density of core-derived pairs, and the black line is a linear regression fit (slope=0.39, intercept=0.056, r=0.39, p<2.2e-238, stderr=0.008; outliers (>95th percentile) are not plotted for ease of visualization. (C)Derived and core regions of the same age are enriched for binding of different TFs and enrichment patterns are generally consistent across ages.

TFBS enrichment for each age was tested using Fisher’s exact test; only TFs with at least one significant enrichment (FDR < 0.1) are shown. Vertebrate, Sarcopterygii, and Tetrapod enhancer ancestors were grouped into “Vertebrata”. Boreotherian, Euarchontoglires, and Primate enhancer ancestors were grouped into “Boreoeutheria”.

11 130

C1 D2 D1

Summarized

Syntenic block HepG2 Actual

C1 D1 + D2

C1 D2

D1 C1

Including zeros Excluding zeros

slope=0.234, intercept=0.040, rvalue=0.238, pvalue=5.113e-140, stderr=0.009

slope=0.156, intercept=0.050, rvalue=0.177, pvalue=2.228e-84, stderr=0.008 slope=0.274,

intercept=0.026, rvalue=0.273, pvalue=0.0, stderr=0.007

slope=0.196, intercept=0.033, rvalue=0.210, pvalue=1.732e-218, stderr=0.007

Figure S11: High TFBS density in core regions correlates with high TFBS density in derived regions within the same HepG2 enhancer sequence

We evaluated TFBS density correlations between core and derived sequences of the same enhancer in HepG2 cCREs. TFBS density of core and matched-derived regions per enhancer are plotted on the X- and Y-axis, respectively. Actual enhancers can have more than one core or derived region, so we evaluated our data using two different approaches. In the first (upper) we summarized TFBS density across multiple core and derived regions by summing TFBS density and syntenic length into core and derived groups and quantifying TFBS density in summarized core and derived regions per enhancer sequence. In the second (lower), we quantified TFBS density for every core and derived syntenic region and compared all possible pairs of core and derived syntenic TFBS densities per enhancer. We applied two different thresholds for evaluating TFBS density in core versus matched derived regions; one threshold allowed for regions with no evidence of TFBS in core or derived sequence, but not both (left, “zeros included”), while the other threshold required that TFBS binding was detected in both core and derived sequences within an enhancer (right,

“zeros excluded”). Linear regression models were fit for each dataset and model features are annotated for each analysis. Histograms (right and above) display distributions of core and derived TFBS density per analysis.

C1 D2 D1

Summarized

Syntenic block K562 Actual

C1 D1 + D2

C1 D2

D1 C1

Including zeros Excluding zeros

slope=0.431, intercept=0.040, rvalue=0.430, pvalue=0.0, stderr=0.007

slope=0.393, intercept=0.056, rvalue=0.386, pvalue=0.0, stderr=0.008

slope=0.344, intercept=0.052, rvalue=0.364, pvalue=0.0,

stderr=0.006

slope=0.293, intercept=0.074, rvalue=0.315, pvalue=0.0, stderr=0.007

Figure S12: High TFBS density in core regions correlates with high TFBS density in derived regions within the same K562 enhancer sequence

We evaluated TFBS density correlations between core and derived sequences of the same enhancer in K562 cCREs. TFBS density of core and matched-derived regions per enhancer are plotted on the X- and Y-axis, respectively. Actual enhancers can have more than one core or derived region, so we evaluated our data using two different approaches. In the first (upper) we summarized TFBS density across multiple core and derived regions by summing TFBS density and syntenic length into core and derived groups and quantifying TFBS density in summarized core and derived regions per enhancer sequence. In the second (lower), we quantified TFBS density for every core and derived syntenic region and compared all possible pairs of core and derived syntenic TFBS densities per enhancer. We applied two different thresholds for evaluating TFBS density in core versus matched derived regions; one threshold allowed for regions with no evidence of TFBS in core or derived sequence, but not both (left, “zeros included”), while the other threshold required that TFBS binding was detected in both core and derived sequences within an enhancer (right,

“zeros excluded”). Linear regression models were fit for each dataset and model features are annotated for each analysis. Histograms (right and above) display distributions of core and derived TFBS density per analysis.

13 132

core der arch 10

12 14 16 18

IC

9

7 HepG2 TF motifs significantly enriched ChIP-seq

core der

arch 10

12 14 16 18 20 22 24

IC

16 16

HepG2 TF motifs all ChIP-seq

core der

arch 12

14 16 18 20 22

IC

4 7

K562 TF motifs significantly enriched ChIP-seq

core der

arch 7.5

10.0 12.5 15.0 17.5 20.0 22.5 25.0

IC

39 31

K562 TF motifs all ChIP-seq

Figure S13: Information content of TFBS motifs in derived sequences is comparable with motifs in core sequences in HepG2 and K562 enhancers

We quantified the information content (IC) of JASPAR core vertebrate non-redundant TF motifs corre- sponding to significantly enriched HepG2 ChIP-seq signal in core or derived regions. We observed higher IC in derived motifs than in core motifs (upper left panel; median 14.9 derived v. 12.6 core IC, Welch’s test p-value = 0.03). We performed a similar analysis in K562 ChIP-seq datasets and found derived TF motifs have higher information content than cores, but this was not significant (lower left panel; median 15.6 derived v. 15.2 core IC, Welch’s test p-value = 0.26). Relaxing our criteria, we also evaluated IC for all TF motifs with any enrichment for ChIP-seq binding in core/derived sequences. IC was similar for TF motifs in HepG2 elements (upper right; median 13.8 core and 14.6 derived information content, Welch’s test p-value

= 0.18) and K562 elements (lower right; median 14.6 core and 15.4 derived information content, Welch’s test p-value = 0.35). Together, these data support that derived TF motifs are just as robust to mutations as core motifs.

HomoPrimBoreEuth

TherMamAmni

TetrVert

core sequence age 0.0

0.5 1.0 1.5 2.0 2.5 3.0

Predicted activity per bp 137278216191870715301200822237638 0426745272119411553667211812 035191015091343898426147914

HEPG2

simple complex_core complex_derived

HomoPrimBoreEuth

TherMamAmni

TetrVert

core sequence age 0.0

0.5 1.0 1.5 2.0

Predicted activity per bp 4648901150168921328110932033890 04431006407517431106331205937 0915747163015096405594381130

K562

simple complex_core complex_derived

Figure S14: MPRA activity is similar across sequence ages and simple, core, or derived contexts

MPRA predicted activity per bp from Ernst 2016 is similar across ages in K562 and HepG2 cells. Here, predicted activity per bp scores are stratified by core sequence age and simple, core, or derived category.

Cell line models and N bp are annotated per bar.

15 134

Homo Prim Euar Bore Euth Ther Mam Amni Tetr Vert Core sequence age

0.0 0.2 0.4 0.6 0.8 1.0

linsight_score

LINSIGHT per bp

simple derived complex_core

Figure S15: Derived regions experienced weaker purifying selection than cores and simple enhancers of the same age.

Stratified by sequence age, derived regions of complex FANTOM enhancers have lower LINSIGHT scores than core regions of the same age for all sequences older than the Eutherian branch. Number of measurements is annotated per bar.

Homo Prim Euar Bore Euth Ther Mam Amni Tetr Vert Core sequence age

0.0 0.2 0.4 0.6 0.8 1.0

linsight_score

LINSIGHT per bp

simple derived complex_core

Figure S16: Derived regions experienced weaker purifying selection than cores and simple enhancers with the same age as their corresponding core region.

Stratified by core sequence age, derived regions of FANTOM enhancers have lower LINSIGHT scores than adjacent core regions for all core regions older than Boreotherian. Number of measurements is anno- tated per bar.

17 136

simple core derived 0.00

0.02 0.04 0.06 0.08

maf 71415 27691 26451 Homo Prim Euar Bore Euth Ther Mam Amni Tetr Vert0.00

0.02 0.04 0.06 0.08

355 4102 469 4260 49688 4714 3165 1386 428 28480 808 197 1197 8639 5726 5638 2237 1049 22000 1875 474 2095 7193 4878 4679 2056 1242 1959

1000G minor allele frequency

Core Sequence Age

*

Figure S17: Derived regions have higher minor allele frequencies than core regions across human populations.

Global minor allele frequencies from 1000 Genomes were intersected with FANTOM enhancer com- ponents. Singletons were removed. Derived region minor allele frequencies are slightly higher than core region minor allele frequencies (right, mean 0.053 derived v. 0.048 core, derived v. core p = 4.9e-12). Minor allele frequencies stratified by core age and architecture show that derived regions have consistently higher minor allele frequencies compared to core regions at every ancestral origin except Boreotherian. Number of SNPs is annotated per bar.