5 Numerical Studies 5.1 Simulation Setup
5.3 Analysis of Mantle Cell Lymphoma Microarray Data
In this subsection, we apply the proposed method to study the mantle cell lymphoma microarray dataset analyzed by Rosenwald et al. (2003). The dataset contains survival times of 92 patients together with gene expression measurements of 8810 genes for each patient. As 6330 gene expressions contain missing values, we remove them and consider the subset of the remaining 2480 gene expressions. During the study period, 64 patients died of mantle cell lymphoma, and the remaining 28
Table 1 Simulation results of feature screening for Part 1
i∼Normal i∼t (v)
Ps Ps
Model ρ σ Method X(1) X(2) X(3) X(4) Pa X(1) X(2) X(3) X(4) Pa
M1 0.5 1.5 Naive 0.001 0.002 0.002 0.001 0.001 0.000 0.000 0.000 0.000 0.000 DC 1.000 0.998 1.000 1.000 0.998 0.833 0.876 0.855 0.853 0.840 Proposed 1.000 0.999 1.000 1.000 0.999 0.998 0.996 0.997 0.997 0.996 2 Naive 0.000 0.002 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 0.997 0.998 1.000 0.997 0.830 0.869 0.853 0.852 0.837 Proposed 1.000 0.999 1.000 1.000 0.999 0.998 0.996 0.997 0.997 0.996 3 Naive 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 0.997 0.997 0.998 0.997 0.827 0.858 0.844 0.846 0.830 Proposed 1.000 0.998 1.000 1.000 0.998 0.997 0.996 0.996 0.996 0.996 0.8 1.5 Naive 0.000 0.000 0.000 0.001 0.001 0.000 0.000 0.000 0.000 0.000 DC 1.000 0.998 0.998 1.000 0.998 0.845 0.867 0.860 0.855 0.846 Proposed 1.000 0.999 1.000 1.000 0.999 1.000 0.998 0.998 0.997 0.997 2 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.998 0.997 0.997 1.000 0.997 0.838 0.864 0.856 0.847 0.839 Proposed 1.000 0.999 1.000 1.000 0.999 0.998 0.996 0.997 0.997 0.996 3 Naive 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.998 0.997 0.997 0.996 0.996 0.833 0.845 0.836 0.841 0.835 Proposed 1.000 0.998 0.998 1.000 0.998 0.998 0.997 0.996 0.997 0.996 M2 0.5 1.5 Naive 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.951 0.958 0.963 0.969 0.951 0.836 0.846 0.847 0.847 0.836 Proposed 1.000 0.998 0.998 1.000 0.998 0.998 0.997 0.998 0.997 0.997 2 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.950 0.955 0.962 0.963 0.950 0.834 0.844 0.845 0.844 0.834 Proposed 1.000 0.997 0.997 0.998 0.997 0.996 0.996 0.995 0.996 0.995 3 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.946 0.951 0.957 0.957 0.944 0.830 0.839 0.841 0.838 0.830 Proposed 1.000 0.997 0.997 0.998 0.998 0.996 0.995 0.995 0.995 0.995 0.8 1.5 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.953 0.956 0.960 0.966 0.953 0.831 0.844 0.844 0.843 0.831 Proposed 1.000 0.999 0.998 0.999 0.998 0.998 0.998 0.998 0.997 0.997 2 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.949 0.951 0.954 0.957 0.951 0.830 0.840 0.841 0.841 0.831 Proposed 1.000 0.997 0.998 0.998 0.997 0.996 0.997 0.997 0.996 0.996 3 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 0.945 0.947 0.950 0.953 0.943 0.826 0.831 0.833 0.833 0.831 Proposed 1.000 0.996 0.997 0.997 0.996 0.996 0.996 0.997 0.996 0.996 t (v): Thetdistribution with the degrees of freedomv
Table 2 Simulation results of feature screening for Part 2
i∼Normal i∼t (v)
Ps Ps
Model ρ σ Method X(1) X(2) X(3) X(4) Pa X(1) X(2) X(3) X(4) Pa
M3 0.5 1.5 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.874 0.878 0.870 0.872 0.870 Proposed 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.998 1.000 0.997 2 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.869 0.874 0.866 0.867 0.865 Proposed 1.000 1.000 1.000 1.000 1.000 0.997 0.997 0.998 1.000 0.997 3 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.866 0.870 0.863 0.863 0.863 Proposed 1.000 1.000 1.000 1.000 1.000 0.996 0.997 0.997 1.000 0.996 0.8 1.5 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.870 0.875 0.870 0.870 0.870 Proposed 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.998 0.998 0.997 2 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.866 0.870 0.865 0.865 0.865 Proposed 1.000 1.000 1.000 1.000 1.000 0.997 0.997 0.998 0.998 0.996 3 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.860 0.861 0.860 0.860 0.860 Proposed 1.000 1.000 1.000 1.000 1.000 0.996 0.997 0.996 0.997 0.996 M4 0.5 1.5 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.879 0.881 0.878 0.878 0.878 Proposed 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.998 1.000 0.998 2 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.879 0.877 0.876 0.878 0.876 Proposed 1.000 1.000 1.000 1.000 1.000 0.999 0.998 0.998 1.000 0.998 3 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.876 0.877 0.876 0.875 0.875 Proposed 1.000 1.000 1.000 1.000 1.000 0.997 0.997 0.998 0.999 0.997 0.8 1.5 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.875 0.876 0.875 0.875 0.875 Proposed 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.998 0.998 0.998 2 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.870 0.872 0.871 0.871 0.870 Proposed 1.000 1.000 1.000 1.000 1.000 0.998 0.997 0.997 0.996 0.996 3 Naive 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 DC 1.000 1.000 1.000 1.000 1.000 0.868 0.867 0.867 0.869 0.866 Proposed 1.000 1.000 1.000 1.000 1.000 0.996 0.995 0.997 0.995 0.995 t (v): Thetdistribution with the degrees of freedomv
patients were censored, yielding the censoring rate 30%. As commented by Chen &
Yi (2021a), gene expressions are usually measured with error, and it is imperative to account for this feature in data analysis.
Since the dataset has no additional information to characterize the degree of measurement error, to implement the proposed method, we conduct sensitivity analyses to address the measurement error effects, a commonly used strategy for exploring the impacts of different magnitudes of measurement error (e.g., Chen
& Yi2020,2021a,b). LetXandX∗ denote the covariance matrices ofXi and Xi∗, respectively, and letσXlk,σX∗lk, andσlkdenote the entry(l, k)ofX,X∗, and, respectively. Measurement error model (7) suggests thatσXlk is smaller thanσX∗lk for all l andk. To consider possible representative scenarios, we use the sample covariance X∗ to estimate X∗ and take σXlk as σXlk = 0.9σX∗lk, whereσX∗lk is the entry(l, k)ofX∗. To specifyσlk, we use the reliability ratio Rlk =σσX∗lkXlk = σXlkσXlk+σlk to guide us:
σlk=(Rlk−1−1)σX∗lk. (25) For ease of exposition, we takeRlkas a common constant for alllandkand letR denote it. Then (25) gives =(R−1−1)X∗. Whenis given, the distribution of the noise term(j )forj =1,· · ·,2480 is specified as a normal or atdistribution with degrees of freedom specified as
* 2σ
jj
σjj−1
+ . Setq = *
92 log 92
+ = 20, indicating that we aim to retain 20 variables with the first 20 largestωj. For the feature screening method in Sect.3.2, we directly choose q gene expressions; for the iteration method in Sect.4, we takeN =2, where we retain the firstq1 =15 gene expressions in Step 1, and then select the remaining q−q1=5 gene expressions in Step 2.
Noting that the feature screening results are similar for different values ofR, to ease presentation, we summarize 20 selected genes’ ID numbers for the case withR =0.85 in Table3, where “FS” represents the feature screening method in Sect.3.2and “IFS” stands for the iteration method in Sect.4. In comparison, we use the naive method by directly implementing (13) rather than (15) as in the proposed method to do feature screening, and let “nFS” and “nIFS” be the naive feature screening method and the naive iteration method in Sects.3.2and4, respectively.
It is interesting that genes “29854,” “27116,” “24721,” and “22155” are retained regardless of the specification of the distribution for the noise term, while other genes may be retained or excluded, depending on the distribution form of the noise term. For example, “26050” is retained if a normal distribution is assumed for the noise term, but it is excluded if a noise term assumes atdistribution; on the contrary,
“25230” is retained if the noise term assumes atdistribution, but it is excluded if a normal distribution is assumed for the noise term. Regarding the iteration method, we observe that genes “23970,” “16835,” and “31992” are retained regardless of the specification of the distribution for the noise term. On the other hand, genes “26692”
Table 3 Sensitivity analyses of mantle cell lymphoma microarray data: the results of feature screening
Normal tdistribution Naive
FS IFS FS IFS nFS nIFS
1 29854 29854 29854 29854 29598 29598
2 27116 27116 27116 27116 33027 33027
3 22155 22155 24721 24721 16787 16787
4 24721 24721 22155 22155 32583 32583
5 17230 17230 17545 17545 16121 16121
6 16528 16528 15936 15936 32049 32049
7 17545 17545 24819 24819 17548 17548
8 24819 24819 26050 26050 27361 27361
9 15936 15936 34524 34524 24610 24610
10 34524 34524 15936 15936 30282 30282
11 26050 26050 28726 28726 23970 23970
12 25230 25234 25230 25230 16835 16835
13 31935 31935 17176 17176 31992 31992
14 30575 30575 17927 17927 32259 32259
15 26692 26692 27019 27019 31895 31895
16 25234 31895 24850 31895 28726 29930
17 25055 23970 25055 23970 17176 24319
18 30157 16835 30157 16835 17927 31098
19 33570 31992 24545 31992 27019 27998
20 24734 32259 27998 31098 24850 24545
FS: The proposed feature screening method in Sect.3.2 IFS: The proposed iteration method in Sect.4
nFS: The naive feature screening method by implementing (13) to Sect.3.2 nIFS: The naive iteration method by implementing (13) to Sect.4
and “32259” are retained when the noise term is specified as a normal distribution but are excluded when the noise term assumes at distribution, and genes “27019”
and “31098” are retained when the noise term is assumed to follow at distribution but are excluded if a normal distribution is assumed. Finally, we observe that without the error correction step, the genes retained by the naive methods have little in common with those obtained from the proposed methods.
To further explore the relationship between the response and the selected genes, we fit the Cox model to those data with the selected genes obtained, respectively, from the FS and IFS methods under the normal ortdistribution specified previously for the measurement error, model, withR =0.85:
λ(t|XI)=λ0(t )exp
XIβ
, (26)
whereλ0(t )is the unspecified baseline hazard function,XIis the vector of selected genes withIrepresenting the estimated active set of genes listed in Table3, andβ is the vector of associated parameters.
To correct for the measurement error effects, we apply the insertion method proposed by Chen & Yi (2021b) to estimate β. The analysis results, including estimates (EST), standard errors (S.E.), andp-values, are summarized in Table4.
For those genes that are commonly selected under different distributions for the measurement error term, their estimates are stable and significant at the significant level 0.05. On the contrary, genes “25055” and “33570,” which are retained by FS but excluded by IFS, are insignificant. The last five genes, retained by IFS, are all significant with smallp-values.