The rest of this chapter is organized as follows. Section 2 introduces the inference background and the new method that can control the FDR and sensitivity simultaneously. Numerical studies are reported in Sect.3to show the superb of the proposed method in high-dimensional settings even when strong multicollinearity exists in the predictors. The conclusion and discussion are described in Sect.4.
2 Methodology
where vMk = 1/AMk. Hence, XTM
kuMk = AMk1Mk. The correlation vector between the equiangular direction and all predictors can be calculated by
a=XTuMk. (8)
LetSMT
kXTM
kuMk be a subvector ofafor|Mk| < n. At thekth stage of selecting the entering predictor, letCˆkbe the largest absolute value of the correlation between the entering variables and the current residualZk. LARS finds the predictor that has the smallest angle with the current residual, and proceeds in the direction ofuMk, which has the same angle with all Xjk’s,jk ∈ Mk, in a step size ofγˆ until the next predictor earns its “most correlated” position. By the end of each stage, LARS updates the mean function, i.e.,
ˆ
μMk+1 = ˆμMk + ˆγ uMk, (9) where
ˆ γ = min
l /∈Mk
+
( Cˆk− ˆcl
AMk−al, Cˆk+ ˆcl AMk +al
)
, (10)
wherecˆlis the current correlation of thelth remaining predictor variable and min+ indicates the smallest positive value. The mean functionμˆ can be written as
ˆ
μMk =UMkMk, (11) where UMk =
u1,u2,· · ·,uk
and M = (γˆ1,γˆ2, . . . ,γˆk)T. Denote β(ˆ Cˆk) as the regression coefficients of the active predictors at stage k, β(ˆ Cˆk) = (XTSXS)−1XSTUMkMk. The current correlation can also be expressed as the score vector of the least squares criterion with entering predictor:
Cˆk = −sjk
2
∂
∂βjk n i=1
(yi −xiTβ)2
β= ˆβ(Cˆk). (12) Define θ (Xjk, Zk) as the angle between the vector Xjk and Zk. Since Xjk is standardized, we have
cos{θ (Xjk, Zk)} = XTZk ∞ Zk 2 = Cˆk
Zk 2. (13)
In general, |cos{θ (Xjk, Zk)}|, k = 1,2,3, . . ., diminish stochastically. LARS solution path ends at a predetermined step or when the angleθ (Xjk, Zk) is very close toπ2, i.e., the remaining variable is almost orthogonal to the current residual.
Lemma 1 ForAMk ≥ 1, the sequence|cos{θ (Xjk, Zk)}|,k =1,2, . . . , n−1, is non-increasing along the LARS solution path.
Proof For simplicity, we useθkto denoteθ (Xjk, Zk).
Note thatCˆkdeclines withkincreases (Efron et al.2004). Showing 1≥ ZCˆ112 ≥
ˆ C2
Z2 2 ≥. . . is equivalent to show ˆCˆk
Ck+1 ≥ ZZk+k122 ≥1, fork=1,2, . . ..
By Eq. (9),Zk−Zk+1= ˆγkuMk. Hence,γˆk2=(Zk−Zk+1)T(Zk−Zk+1), for k=1,2, . . ..
From Eq. (5), (8), (9), and (12), we obtain
Cˆk− ˆCk+1= ˆγkAk ≥ ˆγk = Zk−Zk+1 2≥ Zk 2− Zk+1 2. The last inequality is based on the triangle inequalities. We can obtain ˆCˆk
Ck+1 ≥
Zk 2
Zk+1 2, that is,|cos(θk)| ≥ |cos(θk+1)|, fork=1,2, . . . , n−1.
Note that in the traditional linear regression model with intercept,(1/AMk)2is the first element of the diagonal of hat matrix, which is always bounded byn1and 1.
Lemma 2 ForZ(=0)∈Rn, the following events are equivalent:
{ Zk+1 2cosθk+1≤ Zk 2cosθk ≤ Zk−1 2cosθk−1} = {θk−1≤θk ≤θk+1}. Proof The event in the left hand is equivalent to{ ˆCk+1 ≤ ˆCk ≤ ˆCk−1}, fork = 2,3, . . ., which has the monotone property as shown in Efron et al. (2004). The monotonicity ofθ’s and the one-to-one correspondence ofCˆkandθk,k=2,3, . . . have been verified in Lemma1. Hence, the above events are equivalent.
Recall in the linear regression model (1), negligible or zero value of residual ei =yi−xTi βˆ shows a good prediction. In the LARS context, the absolute value of the corresponding angle at each knot is bounded by π2, and no more predictor will enter the model once the angle is “big” enough. We consider the angle close toπ2 to be “big” enough.
We can make inference using the angle by assuming the angles follow a (truncated) cosine distribution. We connect the angleθkof each LARS solution path to the incremental null hypothesis that measures whetherMkstatistically surpasses Mk−1or not. The limiting distribution of the maximum angle can be used to do an efficient and robust significance test for each predictor variable.
We propose a truncated cosine distribution in a data-driven fashion. Letθ(1)and θ(n)be the minimum and maximum order statistics ofθ’s, respectively. Under the domain of[θ(1), θ(n)], we defined the following (truncated) cosine distribution with the density function:
f (θ )=
⎧⎪
⎨
⎪⎩ 1 2bcos
θ−a b
ifθ(1)≤θ≤θ(n), 0 otherwise,
(14)
where the location parameter a = (θ(1) +θ(n))/2 and the scale parameter b = (θ(n)−θ(1))/π.
Its cumulative density function (CDF) is given by
F (θ )=
⎧⎪
⎪⎪
⎪⎨
⎪⎪
⎪⎪
⎩
0 ifθ < θ(1), sin2
θ−a 2b +π
4
ifθ(1)≤θ≤θ(n), 1 ifθ > θ(n).
(15)
This CDF,F (θ ), of cosine distribution can be used to do hypotheses testing of whether “Mk improves overMk−1” by the following theorem.
Theorem 1 Assume that the covariate vectorsXj’s, j = 1, . . . , p, are linearly independent in the LARS solution path. Letθ(j ),j =1, . . . , n, be the corresponding angle at each knotCˆjin the firstnsteps.aandbare defined in Eq. (14). If Lemma1 and2hold:
n 2b2
π 2 −θ(n)
2 d
→χ22asn→ ∞, (16) whereχ22denotes a chi-square random variable withdf =2.
Proof We know thatθ(j )’s,j =1, . . . , n, are monotone increasing. Hence,θ(1)and θ(n)can be considered as the minimum and maximum order statistics ofθ’s. As the dimension increases,π2 −θ(n)will diminish stochastically.
Letθ˜n = 2bn2(π2 −θ(n))2. From the CDF of the cosine distribution Eq. (15) and the basic trigonometric formula, the distribution ofθ˜ncan be derived as follows:
P (θ˜n≤g)=P
&
n 2b2
π 2 −θ(n)
2
≤g '
=P (
θ(n)≥ π 2 −b·
2g n
1/2)
=1−sin2n
⎡
⎢⎣
π 2 −b·
2g n
1/2
−a
2b +π
4
⎤
⎥⎦
=1−cos2n 1
1 2
2g n
1/2
+π 4 − π
4b + a 2b
2
, over 0≤g≤ nπ2 8b2. Therefore, the limiting distribution ofθ˜nis obtained as
nlim→∞P (θ˜n≤g)=1− lim
n→∞cos2n 11
2 2g
n 1/2
+π 4 − π
4b+ a 2b
2
≈1−e−g/2, g≥0, since cos2n
31
2b(2gn)1/2+π4 −4bπ +2ba4
≈(1−4ng)2n =e−g/2asn→ ∞. Hence,θ˜n→d χ22.
The limiting distribution ofθ˜ndetermines if the corresponding angle at knotCˆkis
“big” enough. A sequence ofp-values can be obtained by using the above property P (χ22>θ˜j),j =1, . . . , n.
2.2 Selection Criteria
Definition 1 (Family of “Accumulation Tests,” Li & Barber,2017) LetMmbe the model that includes the firstmentries. For an integerk∈ {1, . . . , m}, a sequence of null hypotheses,Hj,j =1,2, . . . , k, measures whether modelMj statistically surpasses Mj−1 or not. Suppose there is a sequence of uniformly distributed p- values,p1, p2, . . . , pk ∈ [0,1] corresponding to the hypothesesHj. For a given function φ : [0,1] → [0,∞)satisfying 1
t=0φ (t )dt = 1, where φ is termed
“accumulation function,” the “accumulation tests” determine the stopping pointkˆ to control FDR at levelα
ˆ
kφ=max
⎧⎨
⎩k∈ {1, . . . , m} : 1 k
k j=1
φ (pj)≤α
⎫⎬
⎭. (17)
We suggest using φ (x) = √x
1−x2 to choose the stopping point kˆφ. Testing the hypothesis H0 : the jth angle is the maximum one that is equivalent to testing whether the current model is adequate along the LARS solution path. We reject all hypotheses up tokˆφto obtain the final model.