Table 1 Selected model size and selection accuracy for Scenario I
p=100 p=1000
n Method Result ρ=0 ρ=0.1 ρ=0.5 ρ=0.9 ρ=0 ρ=0.1 ρ=0.5 ρ=0.9 100 cosine PoSI E(|Ms|) 3.05 3.17 7.77 11.35 3.23 4.83 16.24 23.90
P (T⊂Ms) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 LARS-sI E(|Ms|) 1.00 1.04 1.02 1.00 1.02 1.00 1.00 1.00 P (T⊂Ms) 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 200 cosine PoSI E(|Ms|) 3.00 3.00 5.68 10.02 3.00 3.05 10.98 19.94 P (T⊂Ms) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 LARS-sI E(|Ms|) 2.94 3.00 3.00 2.96 1.04 1.04 1.00 1.00 P (T⊂Ms) 0.93 0.99 0.98 0.98 0.01 0.00 0.00 0.00
0,0.1,0.5, or 0.9, respectively. This scenario follows Example I in Fan and Lv (2008) with a fixedσ2=1.
Table1reports the average model sizes and percentage of all the true predictors included in the selected models. It shows that the proposedcosine PoSI method works perfectly for the cases ofn=200,p =100,1000, andρ =0 (independent predictor variables) and works very well for the cases ofn=100,p=100,1000, andρ = 0. The selected model size increases as the value ofρincreases, but the increments are very small. The selection accuracy is perfect for all the cases, which means the selected final model always contains the entire set of true covariates (with non-zero coefficients). The competitorLARS-sIalso works very well for the low-dimensional case ofn > p (n =200, p =100)with an over 90% selection accuracy, but for the high-dimensional casesp ≥ n,LARS-sIis very conservative by only keeping the “strongest” (the first one) variable in the model.
3.1.2 Scenario II: Auto-Regressive Correlation
In this scenario, we use model (1) with β = (3,1.5,0,0,2,0, . . . ,0)T. The predictorsX1, . . . ,Xpand the noiseare generated the same as in the first scenario, but having different covariance matrices for the predictors. The covariance matrix has entriesσii=1,i=1, . . . , pandσij =ρ|i−j|,i=j. This example is modified from Example 1 of Tibshirani (1996) withρbeing set at 0, 0.5,0.7, or 0.9.
The average model sizes and percentage of all the true predictors included in the selected models are reported in Table 2. We see the same pattern as in the first scenario that the proposed post-selection method always selects the true model with an accuracy rate of 100%, even when the predictors are highly correlated as with ρ = 0.9, with a small increment of model size. The selected model size increases as the value ofρ increases. The competitorLARS-sIworks fine for the low-dimensional case (n > p) with a bit low accuracy between 50 and 80%. Since it always misses some predictors by only keeping one predictor in the final model for high-dimensional cases (n < p), the accuracy is zero.
Table 2 Selected model size and selection accuracy for Scenario II
p=100 p=1000
n Method Result ρ=0 ρ=0.5 ρ=0.7 ρ=0.9 ρ=0 ρ=0.5 ρ=0.7 ρ=0.9 100 cosine PoSI E(|Ms|) 3.03 3.15 3.51 4.24 3.23 3.33 3.79 4.71
P (T⊂Ms) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 LARS-sI E(|Ms|) 1.12 1.04 1.00 1.00 1.02 1.00 1.00 1.00 P (T⊂Ms) 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 200 cosine PoSI E(|Ms|) 3.00 3.04 3.39 4.10 3.00 3.04 3.39 4.01 P (T⊂Ms) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 LARS-sI E(|Ms|) 2.74 2.73 2.77 2.37 1.00 1.00 1.00 1.00 P (T⊂Ms) 0.76 0.80 0.82 0.46 0.00 0.00 0.00 0.00
3.2 A Real-Data Application
We apply the proposed method on the data reported in Scheetz et al. (2006) to illustrate the usage of the proposed method. In this chapter, F1 rats were intercrossed, and the eye tissues from 120 twelve-week-old maleF2 offspring rates were obtained. The microarray technique is then used to obtain the gene expressions of those eye tissues from the rats for over 31042 genes. Among those genes, one gene labeled “T RI M32” was recently found causing the Bardet–Biedl syndrome, and it is believed to be linked with a small number of other genes. We are interested in finding those genes that are possibly linked to the “T RI M32” using the gene expression data.
A subset of the microarray study can be found in theRpackageflare. The dataset contains two parts:
(1) The gene expression data of 200 genes for the 120 rats are recorded inX—an 120×200 matrix.
(2) The gene expression data of the gene “T RI M32” for the 120 rats are recorded in a vectorY.
The relationship of the gene expression of “T RI M32” and other genes is postulated using model (1). The Leave-One-Out (LOO) cross-validation technique is used to estimate the model accuracy in the selection process, measured by the total square error#n
i=1(Yi − ˆYi)2. We invoke the methods to obtain the models using the post-selection inference procedures on the training set, then obtain the OLS estimator of those variables via a linear regression, and use the obtained model to predict the gene expression of the “T RI M32” gene that was left out. Table3reports the mean and the standard deviation (SD) of the total square errors for prediction and the mean and median of model sizes from thentraining sets. It can be seen that both the mean total squared error and the standard error from the proposedcosine PoSI are much smaller than those fromLARS-sI, which justifies that the proposedcosine PoSImethod keeps the useful genes in the post-selection inference procedure, while
Table 3 Data analysis of eye microarray data (LOOCV)
Mean SD Model size Model size
Method square errors square errors (mean) (median)
cosine PoSI 0.3548 0.3523 27.6750 28.0000
LARS-sI 15.5772 1.5019 1.0000 1.0000
Table 4 Final models for
eye microarray full data Method Model Size MSE Adj ust ed R2 cosine PoSI 29 0.0041 0.8009
LARS-sI 1 0.0087 0.5776
LARS sIis too conservative and most likely screens out many relevant genes. The LARS sIstill only contains the very first entered variable in analysis.
We then apply the cosine PoSI method, in contrast to theLARS-sI approach, to obtain a final model by applying them to the full data to first select relevant genes and then obtain the final model by estimating the coefficients in the linear regression. Table 4 reports the selected final model size, the mean square error (MSE), and adjusted R2 for the two approaches. We see the proposed cosine PoSI method contains a larger model than that from the LARS-sI procedure.
The final model fromcosine PoSI method keeps 29 genes (ID in packageflare):
{153,55,99,87,42,85,180,177,109,90,199,112,36,185,62,136,200,155, 187,146,188,134,141,172,127,11,54,181,164}, while the final model from the LARS-sIprocedure is too conservative and only includes the first gene. The adjusted R2from the proposed model is 0.8009, while it is only 0.5776, which means some useful genes are screened out in theLARS-sI. We also note that our own LARS code generates the same solution path as that of the function “lar” in the package selectiveInference.