• Tidak ada hasil yang ditemukan

The pyPheWAS package provides several useful data preparation functions so that users do not have to directly manipulate the very large data files often used for PheDAS studies.

3.2.1. Defining case and control groups

The first step in a PheDAS study is defining which subjects are cases and which are controls. In the absence of externally defined group assignments (such as genetic markers [172] or white blood cell count [178]), ICD codes themselves may be used as a proxy for diagnosis [168], [185] (although sources of error for this are well known [49]). The ICD-9 code 758.0 – Down’s syndrome, for example, may be used as a proxy for the actual clinical diagnosis of Down Syndrome. Due to the noisy nature of EMR, however, a minimum frequency threshold is applied to codes used for this proxy diagnosis based on the notion that the more frequently a subject is assigned a certain ICD code, the more likely it is that they legitimately have the target condition.

To address this need, the createPhenotypeFile function sorts subjects into case and control groups based on the presence or absence of ICD codes in subjects’ records. At a minimum, createPhenotypeFile requires a phenotype file, a list of ICD-9 and ICD-10 codes that define the case group, and the minimum frequency

of those codes in a subject’s record to be considered part of the case group. Users may specify whether this frequency threshold is a daily threshold (code frequency is calculated based on the number of unique days over which a code is recorded; ignores multiple records of a code within a single day) or an absolute threshold (code frequency is calculated based on the absolute number of code events; includes multiple records of a code within a single day). All subjects listed in the phenotype file who have at least the minimum frequency of provided codes in their record are assigned to the case group (target=1). Subjects who have the provided codes in their record but fall below the specified frequency are considered ambiguous and, consequently, excluded. All remaining subjects are assigned to the control group (target=0).

These group assignments are saved to a comma-separated values (CSV) file containing A) only subject IDs and target variable assignments, or B) the target variable assignment added to an existing group file specified by the user.

In the basic configuration described above, the control group is comprised of all non-case and non- ambiguous subjects. In some experiments, however, it may be desirable to enforce stricter control group inclusion criteria; createPhenotypeFile provides two commonly used practices for narrowing the scope of PheDAS control groups. The first method excludes subjects from the control group based on both the provided case codes and codes related to those case codes; this prevents the control group from becoming contaminated by conditions similar to the target condition. The list of related codes may be supplied by the user or pulled from the ICD phenotype map (see the pyPhewasLookup section for details on the ICD phenotype map used by pyPheWAS). The second method allows users to target a specific condition for the control group. For example, a PheDAS could be performed comparing Alzheimer’s disease patients (case) to Vascular Dementia patients (controls). In this case, the user would supply createPhenotypeFile with lists of ICD-9 and ICD-10 codes for both the case group and the control group. The control group is then composed of subjects not in the case group that have at least the minimum frequency of provided control group codes in their record. Optionally, a second argument may be provided to the code frequency input;

if this is specified, the second frequency value is applied to the control group.

3.2.2. Converting dates to ages

EMR event data is usually tagged with dates. In certain cases, a researcher may choose to study EMR records only within a specific period of time, or they may want to use age as a covariate. For convenience, the convertEventToAge script allows users to quickly convert dates associated with CPT and ICD events to subject ages at the events. This function requires the phenotype file for which event dates are to be converted and a corresponding group file that contains each subjects’ date of birth. Optionally, the user may specify the level of precision with which ages are saved in the output phenotype file.

3.2.3. Censoring event data

A common aim of medical studies is to examine specific periods of time in patients’ lives. For example, one may be interested in the EMR signature for the five years leading up to an Alzheimer’s Disease diagnosis or for children ages 10 to 18 who have Autism/Autism Spectrum Disorder. Data censoring such as this is incorporated into the pyPheWAS toolkit with the censorData function. Similar to other tools, this function requires a phenotype file containing the events to be censored and a group file containing subject information, along with user-specified censoring start and/or end years. Censoring can be applied to the data in two distinct ways. The first method censors the absolute value of event ages (e.g. the age at CPT or ICD code events) to only those that fall within the user-defined start and end years, such that all preserved events fulfill the equation

𝑠𝑡𝑎𝑟𝑡 ≤ 𝑒𝑣𝑒𝑛𝑡𝐴𝑔𝑒 ≤ 𝑒𝑛𝑑 (1)

The second method instead censors event ages relative to an external event, such as subject age at diagnosis or surgery. In this case, the interval between the events is considered such that all preserved events fulfill the equation

𝑠𝑡𝑎𝑟𝑡 ≤ (𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙𝐸𝑣𝑒𝑛𝑡𝐴𝑔𝑒 − 𝑒𝑣𝑒𝑛𝑡𝐴𝑔𝑒) ≤ 𝑒𝑛𝑑 (2) The censored events are saved to a new phenotype file, and all subjects with event data remaining after censoring are written to a new group file.

3.2.4. Case-control matching

Another common practice in case-control studies such as PheDAS is matching a certain number of control subjects to each case subject based on specified group variables. The pyPheWAS toolkit includes case-control mapping through its maximizeControls tool. This tool requires a group file containing group variables and case/control assignments, a list of variables to match on, tolerance intervals for each of those matching variables, and the desired ratio of controls to cases. It constructs a bipartite graph from the cohort in which subjects are the vertices, matching variables are edges, and the case and control groups are two disjoint independent vertex sets. To find a first set of matches, it uses the Hopcroft-Karp algorithm [186]

to find a mapping between the case and control sets that results in maximal cardinality (i.e., matches). If the desired matching ratio is larger than 1:1, the first set of matched controls are removed from the graph, and the Hopcroft-Karp algorithm is applied again to find a second set; this repeats until either the desired matching ratio is satisfied or there are no more possible matches. A new group file is saved containing all matched subjects, along with a separate matched pairs file containing the explicit mapping between each individual case and its control(s).