Aim 3: To conduct a gene-environment interaction study assessing potential interaction of predicted expression levels of estrogen related genes with adiposity [body mass index (BMI), waist
1. Methods
1.1. Statistical Analyses
We used the most updated transcriptome and SNP data from the whole-genome sequencing (WGS) (version 7) from the GTEx(97) to build expression prediction models for estrogen related genes. Models for all genes were built in breast, liver, ovaries, subcutaneous adipose, visceral adipose and cross tissues.
In GTEx, whole genome sequencing (WGS) was performed by the Broad Institute’s Genomics Platform on DNA samples from 652 GTEx donors at an average coverage of 30X, including 68 sequenced on Illumina HiSeq 2000 using 101-bp paired-end reads, and 584 samples on Illumina HiSeq X using 151-bp paired-end reads, all using blood sample DNA. Genotype data were processed according to the GTEx protocol (https://www.gtexportal.org/home/documentationPage). In brief, BAMs were processed using Picard pipeline. With BWA-MEM, the reads were aligned to the human reference genome build hg19/GRCh37. GATK’s HaplotypeCaller v3.4 was further used for joint variant calling across all subjects. Genotype posterior probabilities were then calculated for all calls based on allele frequency in 1000 Genomes Project Phase 3 version 1, and genotype quality (GQ) scores were updated using GATK’s CalculateGenotypePosteriors. Regarding subject QC, replicates, Klinefelter subjects, a chromosome 17 trisomy subject, and an individual from a related pair were
17
excluded, resulting 635 subjects remained, including 406 males and 229 females. Variant QC included Variant Quality Score Recalibration (VQSR) filtering; removal of variants in low-complexity regions or with inbreeding coefficient ≤ -0.3; assigning calls with allelic imbalance (AB) < 0.2 or > 0.8 and/or GQ < 20 to missing; excluding variants with genotyping call rate < 85%, with HWE P-value <
1×10-6 (using European samples only and females for chrX), associated with library construction batch or sequencing technology (P < 1×10-8), or with heterozygous haploid genotypes in non-
pseudoautosomal regions of sex chromosomes in males. These yielded overall 10,526,813 variants at MAF ≥ 1%.
GTEx genotype data were downloaded from dbGap in the format of a vcf file, which has gone through the standard quality control (QC) as described above. Only SNPs with minor allele frequency
≥ 5%, biallelic, and those that are not strand ambiguous were retained for use. Genotypes were encoded as a continuous variable ranging from 0 to 2, with the value representing the estimated count of the effect allele.
In GTEx, RNA-seq was performed using the Illumina TruSeq library construction protocol (non- stranded, polyA+ selection), generating 76-bp paired-end reads. Gene-level expression was
quantified as Transcripts Per Kilobase Million (TPM) using RNA-SeQC v1.1.8(98). RNA-seq expression outliers were identified and excluded using a multidimensional extension of the
statistic(99). Genes with a median expression level of 0 TPM across samples were removed. We normalized the TPM values to the average empirical distribution observed across samples. Then we quantile normalized the TPM values to a standard normal distribution for each gene across samples.
The gene expression data were adjusted for covariates including sex, top 3 genotyping principal components (PCs), sequencing platform and probabilistic estimation of expression residuals (PEER) factors.
The GTEx v7 data contains 620 participants with both genotype and gene expression data available from at least one of 48 different tissues. Among them, 34.2% are females. Considering the
18
recognized sex difference for estrogen, in our study we focused on data generated from females. The sample size for each specific tissue that were used in our study is presented in Table 1. Among them, most were with European ancestry.
All expression prediction models were built by using the elastic net method as implemented in the glmnet R package, with alpha=0.5. The genetically regulated expression for each gene was estimated by including variants within the 2 MB flanking region of each gene. Ten-fold cross- validation were used to validate the models internally. Prediction R2 values (the square of the correlation between predicted and observed expression) were generated to estimate the prediction performance of each of the gene prediction models established.
Table 1. Number of sample sizes for gene expression model building from different tissues.
Tissue Sample Size
European women Asian women
Breast 85 3
Liver 47 2
Ovaries 99 2
Visceral adipose 94 1
Subcutaneous adipose 121 2
Cross-tissue (include the above 5 tissues) 179 3
1.2. Model building
Our research group at Vanderbilt has generated whole genome RNA-seq data of adjacent normal breast tissue obtained from 200 subjects diagnosed with benign breast diseases or breast cancer recruited to the Shanghai Breast Cancer Study (all women of East Asian ancestry) using Illumina HiSeq. Of them, 151 subjects also had genetic data available (genotyped using MEGA or
19
GWAS arrays). The detailed information for genetic data QC and imputation is described in the next Chapter. The raw RNA-seq reads were processed using a previously described pipeline19. The sequencing reads were mapped to human genome (hg19) using the Bowtie2 tool20. The mapped reads were then used to determine expression for all coding genes and noncoding RNAs by Cufflinks (version 2.2.1)21. The Gencode (release 27) was used to annotate coding genes and noncoding RNAs22. The fragments per kilobase of transcript per million mapped reads (FPKM) for expression of each gene was then determined. We further performed a series of QC and normalization. Specifically, genes with a median expression level of 0 FPKM across samples were removed, and the FPKM values of each gene were log2 transformed. The quantile normalization was performed to bring the expression profile of each subject to the same scale, and inverse quantile normalization was
performed for each gene to map each set of expression values to standard normal. We adjusted for the top ten principal components derived from genotype data, the top 30 probabilistic estimation of expression residual factors to correct for batch effects and experimental confounders in model building, and status of subject (benign disease or breast cancer patients) for model building using data from all 151 subjects. Furthermore, we built another set of models using data focusing only on the 65 subjects with benign breast diseases, for which we adjust for the top 15 probabilistic estimation of expression residual factors to correct for batch effects and experimental confounders in model building.
1.3. Utility
We also evaluate performance of the models built in the European descents in Asians by comparing the predicted gene expression level for each gene with the expression level measured by RNAseq in the 151 Asian breast tissue sample from 151 subjects using the Spearman’s correlation.
Models showed a performance at R2≥0.01 in GTEx dataset and correlation R > 0.1 in the in Asian
20
were deemed reliable models for predicting expression in Asians and were used in association analyses with breast cancer risk in Asians.
1.4. Power calculation
We performed a simulation analysis estimating the model prediction performance that can be achieved according to the varying sample size of the reference dataset for model building and the cis- heritability (h2) which we aim to capture using gene expression prediction models (R2) (Figure 4).
Based on a simulation analysis, it seems the performance of prediction models does not change very much when the reference dataset achieves a sample size of 100 or higher, in each scenario of the cis-heritability (h2). Our study thus should have reasonable power for gene expression prediction model building for both Caucasian and Asians considering that the available reference datasets for the relevant tissues (except for liver) involve a sample size close in scale to the 100 threshold, if not higher (see Table 1 above).
Figure 4. The performance of gene expression prediction models based on different sample size in each scenario of cis-heritability
21 2. Results
The distributions of the gene expression levels were shown in supplementary figure 1-5 for ovary, liver, breast, subcutaneous adipose and visceral adipose tissues, respectively. Only genes with a median expression greater than 0 were included. Since quite a number of the distribution were skewed, normalization for the gene expression levels were conducted before the prediction model building.
Genes that showed a prediction performance (R2) of at least 0.01 (0.1 correlation between predicted and observed gene expression levels) were presented for models that were built for each tissue among only European descendants or both European and Asian descendants.
Of the ovary tissue prediction models built for 74 genes, 21 showed satisfactory prediction (R2
≥ 0.01) performance. Genes showed the best prediction performance were SULT1C2 (R2 = 0.38), SULT1A2 (R2 = 0.25), and HSD17B14 (R2 = 0.22). The first two SULT genes are in the estrogen metabolism pathway, and HSD17B14 is in the estrogen synthesis pathway. European-only model predicted 5 more genes with an R2 greater than 0.01 compared to model including both Europeans and Asians. (Table 2)
For models that were built for liver tissue, the number of genes that showed good prediction performance were 26 among Europeans. Gene ARSB (R2 = 0.38) in the estrogen metabolism pathway showed the best performance. (Table 3)
Regarding models built in breast tissue, estrogen metabolism pathway genes SULT1C2 (R2 = 0.51) and UGT2B7 (R2 = 0.33) showed the best prediction performance for either exclusive
Europeans or Europeans and Asians. 29 genes showed satisfactory prediction performance. Two additional genes were built for European-only model compared to the mixed-race model. (Table 4)
Genes showed very distinct prediction performance between the two sets of models for subcutaneous adipose. Thirteen extra genes were successfully built in the Europeans. Genes with best prediction performance were also very different between the two models. For European only
22
model, genes showed the best performance were HSD17B11 and SULT1A4, which were in estrogen synthesis- and metabolism- pathways, respectively. For models included Asians, two estrogen-
metabolism pathway genes, SULT1C2 and GSTM1 showed the best prediction performance. (Table 5)
Similar results were observed for the two models build in visceral adipose tissue, with 24 and 23 genes showing a prediction performance greater than 0.01 for the European only and Asian included models, respectively. The top 4 genes showing the best prediction performance included SULT1C2, SULT1A2, UGT2B4 and ARSF, which were all in the estrogen-metabolism pathway.
(Table 6)
For the cross-tissue models, information of the gene expression levels across all 38 tissues collected in GTEx was used. The prediction performances were similarly good between the two sets of models. The top two genes showed the best prediction performance were SULT1C2 and
SULT1A2, which were the same as the visceral adipose tissue model. (Table 7)
Table 2. Prediction models built with an R2 ≥ 0.01 for estrogen-related pathway genes in ovary tissue.
Ovary (only Europeans)
N=99
R2 Ovary
(Europeans and Asians) N=101
R2 SULT1C2 0.377 SULT1C2 0.358 SULT1A2 0.253 HSD17B14 0.214 HSD17B14 0.221 SULT1A2 0.190
ARSA 0.174 CYP11A1 0.189
CYP11A1 0.129 ARSA 0.177
CYP1A1 0.124 SULT1A1 0.154
SULT1A1 0.111 ARSJ 0.120
ARSJ 0.102 GSTM1 0.092
IDS 0.077 ARSD 0.063
CYP1B1 0.077 UGT2B7 0.054
HSD17B1 0.071 IDS 0.052
B3GAT2 0.061 ARSB 0.049
ARSD 0.045 HSD17B1 0.045
CYP3A4 0.045 STS 0.039
23
UGT2B7 0.045 ESR1 0.032
CYP19A1 0.045 CYP1A1 0.028
GNS 0.045 CYP3A4 0.025
GSTM1 0.045 GNS 0.019
ARSB 0.045 ARSE 0.014
ARSG 0.045 B3GAT2 0.012
ARSE 0.045 SHBG 0.011
Table 3. Prediction models built with an R2 ≥ 0.01 for estrogen-related pathway genes in liver tissue.
Liver
(only Europeans) N=47
R2 Liver
(Europeans and Asians) N=49
R2
ARSB 0.383 ARSB 0.559
HSD3B2 0.251 UGT2B17 0.413 UGT2B17 0.242 UGT1A6 0.203 CYP11A1 0.217 UGT2B4 0.168
UGT1A6 0.215 HSD3B2 0.154
ARSE 0.153 UGT2B28 0.131
SULF2 0.140 UGT1A4 0.112
UGT1A4 0.140 SULF2 0.092
AKR1B15 0.134 ARSG 0.079
UGT2B4 0.115 ARSA 0.078
CYP1B1 0.113 ARSJ 0.077
UGT2B28 0.094 ESR2 0.073
B3GAT1 0.086 ARSE 0.069
B3GAT2 0.086 IDS 0.066
SULT4A1 0.085 STAR 0.065
ESR2 0.081 HSD17B1 0.064
STAR 0.077 UGT1A7 0.063
CYP1A2 0.069 CYP1B1 0.059
ARSA 0.064 B3GAT2 0.050
SULT1A1 0.049 SULT4A1 0.042 SULT2B1 0.040 B3GAT3 0.038
GSTA1 0.040 SULT1A1 0.017
SGSH 0.040 CYP1A2 0.016
GSTM1 0.029 SULT1C2 0.011
COMT 0.028