target variable and the interval of time over which a PheCode is experienced. Its feature matrix contains the time in years between the first and last occurrences of each PheCode in a subjectโs record.
The second and third feature matrices are independent of aggregation type and are created as optional covariates for pyPhewasModel. The ICD age feature matrix contains the maximum age recorded for each PheCode in a subjectโs record; if the subject has no records of that PheCode, the subjectโs overall maximum recorded age is reported. The PheWAS covariate matrix allows researchers to use the presence/absence of a specified PheCode as a covariate in the regression. Across all columns, it records a one if the specified PheCode is present in a subjectโs record or zero if the specified PheCode is absent. All three feature matrices are saved as CSV files in preparation for the pyPhewasModel step.
3.3.2. pyPhewasModel
The pyPhewasModel function performs the mass logistic regression which is the focal point of PheDAS analyses. It requires the feature matrix files generated by pyPhewasLookup in addition to the group file. For each PheCode, pyPhewasModel computes a univariate logistic regression of the form
๐๐(๐ก๐๐๐๐๐ก) ~ ๐๐๐๐๐ก(๐ด๐โ๐+ ๐๐๐ฃ๐๐๐๐๐ก๐๐ ) (3) where the target variable and covariates are specified by the user, and ๐ด๐โ๐ is the aggregate measure vector for a particular PheCode phe taken from the aggregate measure matrix.
Figure IV-4 Detailed look at phenotype mapping, aggregation, and regression in pyPhewasLookup.
On the far left, excerpts from input phenotype and group files containing data from subjects A26 and A38 are shown. ICD codes from the phenotype file are mapped to corresponding PheCodes.
These codes are then aggregated via one of three possible methods for each subject; binary, count, and duration aggregations for subject A26 are shown. Finally, the aggregated EMR data is combined with group data (in this case, the target variable Target, and covariates Sex and MaxAgeAtICD), and univariate regressions are computed for each PheCode.
These regressions are only computed on PheCodes for which ๐ด๐โ๐is non-zero in at least X subjects, where X is a user-defined threshold that defaults to 5. This requirement cuts out PheCodes which lack sufficient statistical power. The model is fit to the data via regularized maximum likelihood optimization.
The Python library statsmodels is used to generate and fit the logit model to the PheCode data [188].
Regression results are again saved in a CSV file for the user to review and visualize. This file reports the log odds ratio, confidence interval, standard error, and uncorrected p-value estimated from ๐ด๐โ๐ for each PheCode phe.
3.3.3. pyPhewasPlot
Visualization of the PheDAS mass regression is performed by the pyPhewasPlot function. It requires the regression file produced by pyPhewasModel and the userโs desired multiple comparisons correction method; both False Discovery Rate (FDR) and Bonferroni are available. From these inputs, it creates three complementary views of the PheDAS analysis using the Python matplotlib library [189]. The first is a Manhattan plot, a classic GWAS plot which compares statistical significance across PheCodes. This view presents PheCodes across the horizontal axis, with negative log10(p-value) along the vertical axis; PheCode markers on the plot are colored and sorted according to 18 general categories (mostly organ systems and disease groups, e.g. โcirculatory systemโ and โmental disordersโ), allowing users to distinguish related PheCodes. To enhance legibility, the plot only labels PheCodes which are significant after the chosen multiple comparisons correction is applied.
The second view is a Log Odds plot, which compares effect size across PheCodes. In this plot, the log odds of each PheCode and its confidence interval are plotted on the horizontal axis, with PheCodes plotted along the vertical axis. Similar to the Manhattan plot, PheCode markers are sorted and colored by category;
only PheCodes which are significant after multiple comparisons correction are shown.
The final view is a Volcano plot. This view combines the previous two, presenting an overview of the entire experiment. In the Volcano plot, significance, negative log10(p-value), is represented by the vertical axis, and effect size, log odds, is represented by the horizontal. All PheCodes in the regression file are included on this plot, with marker color corresponding to each PheCodesโs level of significance (none, FDR, Bonferroni). To ensure legibility, only PheCodes that are significant after FDR or Bonferroni correction are labeled.
These three views together provide a comprehensive visualization of the PheWAS analysis. The Volcano plot allows the user to see an overview of the entire experiment, with the Manhattan and Log Odds
plots then providing a detailed view for closer examination of significant results. The user has the option of either opening the plots in an interactive window or immediately saving them as image files.
3.3.4. pyPhewasPipeline
pyPhewasPipeline is a streamlined combination of pyPhewasLookup, pyPhewasModel, and pyPhewasPlot created for convenience. Its required inputs are the phenotype file, group file, and the regression type. All intermediate results (feature matrices, regressions) are saved. In addition to the Volcano plot, Manhattan and Log Odds plots are created for both FDR and Bonferroni corrections by default.
Optional arguments allow users to modify every step of the pipeline (adding covariates, specifying significance level, etc.).