• Tidak ada hasil yang ditemukan

Characterization of the Performance of Regularized Logistic Regression

Chapter VIII: Precise Performance Analysis of Regularized Logistic Regression in High

8.3 Characterization of the Performance of Regularized Logistic Regression

In this section, we present the main results of this chapter, that is the characterization of the asymptotic performance of regularized logistic regression (RLR). When the estimation performance is measured via a locally-Lipschitz function (e.g. mean-squared error), Theorem 11 precisely predicts the asymptotic behavior of the error. The derived expression captures the role of the regularizer, 𝑓(Β·), and the particular distribution ofwβ˜…, through a set of scalars derived by solving a system of nonlinear equations. We first present this system of nonlinear equations along with some insights on how to numerically compute its solution. Afterwards, we formally state our result in Theorem 11. The result of this theorem will then be used to predict the general behavior of Λ†w. In particular, we compute its correlation with the true signal as well as its mean-squared error.

A nonlinear system of equations

As we will see in Theorem 11, given the signal strengthπœ…and the ratio𝛿, the asymptotic performance of RLR is characterized by the solution to the following system of nonlinear equations with six unknowns(𝛼, 𝜎, 𝛾 , πœƒ , 𝜏, π‘Ÿ).



















ο£²



















ο£³

πœ…2𝛼 = E

π‘Š Proxπœ†πœŽ πœπ‘“Λœ(Β·) 𝜎 𝜏(πœƒπ‘Š +π‘Ÿ

√

𝛿 𝑍) , 𝛾 =

√ 𝛿 π‘Ÿ E

𝑍Proxπœ†πœŽ πœπ‘“Λœ(Β·) 𝜎 𝜏(πœƒπ‘Š +π‘Ÿ

√

𝛿 𝑍) , πœ…2𝛼2+𝜎2 = E

Proxπœ†πœŽ πœπ‘“Λœ(Β·) 𝜎 𝜏(πœƒπ‘Š +π‘Ÿ

√ 𝛿 𝑍)2

, 𝛾2 = 2

π‘Ÿ2 E

𝜌0(βˆ’πœ… 𝑍

1) πœ… 𝛼 𝑍

1+𝜎 𝑍

2βˆ’Prox𝛾 𝜌(Β·)(πœ… 𝛼 𝑍

1+𝜎 𝑍

2)2 , πœƒ 𝛾 =βˆ’2E

𝜌00(βˆ’πœ… 𝑍

1)Prox𝛾 𝜌(Β·) πœ… 𝛼 𝑍

1+𝜎 𝑍

2 , 1βˆ’ 𝛾

𝜎 𝜏

= E 2𝜌0(βˆ’πœ… 𝑍

1) 1+𝛾 𝜌00 Prox𝛾 𝜌(Β·)(πœ… 𝛼 𝑍

1+𝜎 𝑍

2) .

(𝑁 𝐿 𝑆)

Here𝑍 , 𝑍

1, 𝑍

2are standard normal variables, andπ‘Š ∼ Ξ , whereΞ denotes the distribution on the entries ofwβ˜…. The following remarks provide some insights on solving this nonlinear system.

Remark 9 (Proximal Operators). It is worth noting that the equations in (𝑁 𝐿 𝑆) include the expectation of functionals of two proximal operators. The first three equations are in terms

99

of Proxπ‘“Λœ(Β·), which can be computed explicitly for most widely used regularizers. For instance, in β„“

1-regularization, the proximal operator is the well-known shrinkage function defined as πœ‚(π‘₯ , 𝑑) := |π‘₯π‘₯|(|π‘₯| βˆ’π‘‘)+. The remaining equations depend on computing the proximal operator of the link function 𝜌(Β·). Forπ‘₯ ∈R, Prox𝑑 𝜌(Β·)(π‘₯)is the unique solution of𝑧+𝑑 𝜌0(𝑧) =π‘₯.

Remark 10(Numerical Evaluation). Definev:= [𝛼, 𝜎, 𝛾 , πœƒ , 𝜏, π‘Ÿ]𝑇 as the vector of unknowns. The nonlinear system (𝑁 𝐿 𝑆) can be reformulated asv = 𝑆(v) for a properly defined 𝑆 : R6 β†’ R6. We have empirically observed in our numerical simulations that a fixed-point iterative method, v𝑑+1 =𝑆(v𝑑), converges tovβˆ—, such thatvβˆ— =𝑆(vβˆ—).

Asymptotic performance of regularized logistic regression

We are now able to present our main result. Theorem 11 below describes the average behavior of the entries of Λ†w, the solution of the RLR. The derived expression is in terms of the solution of the nonlinear system (𝑁 𝐿 𝑆), denoted by(𝛼,Β― 𝜎,Β― 𝛾 ,Β― πœƒ ,Β― 𝜏,Β― π‘ŸΒ―). An informal statement of our result is that as 𝑛→ ∞, the entries of Λ†wconverge as follows,

wˆ𝑗

→𝑑 Ξ“(wβ˜…π‘—, 𝑍), for 𝑗 =1,2, . . . , 𝑝 , (8.4) where𝑍 is a standard normal random variable, andΞ“:R2β†’Ris defined as,

Ξ“(𝑐, 𝑑) :=Proxπœ†πœŽΒ―πœΒ―π‘“Λœ(Β·) 𝜎¯𝜏¯(πœƒ 𝑐¯ +π‘ŸΒ―

√ 𝛿 𝑑)

. (8.5)

In other words, the RLR solution has the same behavior as applying the proximal operator on the

"perturbed signal", i.e., the true signal added with a Gaussian noise.

Theorem 11. Consider the optimization program (8.3), where for 𝑖 = 1,2, . . . , 𝑛, x𝑖 has the multivariate Gaussian distributionN (0, 1

𝑝I𝑝), and𝑦𝑖 =RAD 𝜌(x𝑇𝑖wβ˜…)

, and the entries ofwβ˜…are drawn independently from a distributionΞ . Assume that the parameters𝛿,πœ…, andπœ†are such that the nonlinear system (𝑁 𝐿 𝑆) has a unique solution (𝛼,Β― 𝜎,Β― 𝛾 ,Β― πœƒ ,Β― 𝜏,Β― π‘ŸΒ―). Then, as 𝑝 β†’ ∞, for any locally-Lipschitz2functionΞ¨:RΓ—Rβ†’ R, we have,

1 𝑝

𝑝

Γ•

𝑗=1

Ξ¨(wˆ𝑗,wβ˜…π‘—) βˆ’β†’P E

Ξ¨ Ξ“(π‘Š , 𝑍), π‘Š , (8.6)

where𝑍 ∼ N (0,1),π‘Š ∼Πis independent of𝑍, and the functionΞ“(Β·,Β·)is defined in(8.5).

2A functionΞ¨:R𝑑→Ris said to belocally-Lipschitzif,

βˆ€π‘€ >0, βˆƒπΏπ‘€ β‰₯0, such that βˆ€x,y∈

βˆ’π‘€ ,+𝑀𝑑

: |Ξ¨(x) βˆ’Ξ¨(y) | ≀ 𝐿𝑀||xβˆ’y||.

We defer the detailed proof to Section 8.7. In short, to show this result we first represent the optimization as a bilinear form,u𝑇Xv, whereXis the measurement matrix. Applying the CGMT to derive an equivalent optimization, we then simplify this optimization to obtain an unconstrained optimization with six scalar variables. The nonlinear system (𝑁 𝐿 𝑆) represents the first-order optimality condition of the resulting scalar optimization.

Before stating the consequences of this result, a few remarks are in order.

Remark 11(Assumptions). The assumptions in Theorem 11 are chosen in a conservative manner.

In particular, we could relax the separability condition on 𝑓(Β·)to some milder condition in terms of asymptotic convergence of its proximal operator. Furthermore, one can relax the assumption on the entries ofwβ˜…being i.i.d. to a weaker assumption on the empirical distribution of its entries.

However, for the applications discussed in this chapter, the theorem in its current form is adequate.

In Chapter 9, we will perform a more general analysis with less restrictive assumptions.

Remark 12 (Choosing Ξ¨). The performance measure in Theorem 11 is computed in terms of evaluation of a locally-Lipschitz function,Ξ¨(Β·,Β·). As an example,Ξ¨(𝑒, 𝑣)= (π‘’βˆ’π‘£)2can be used to compute the mean-squared error. In the next section, we will appeal to this theorem with various choices ofΞ¨to evaluate different performance measures onwΛ†.

Correlation and variance of the RLR estimate

As the first application of Theorem 11, we compute common descriptive statistics of the estimate wΛ†. In the following corollaries, we establish that the parameters ¯𝛼, and ¯𝜎in (𝑁 𝐿 𝑆), respectively, correspond to the correlation and the mean-squared error of the resulting estimate.

Corollary 6. As 𝑝→ ∞, ||w1β˜…||2 wˆ𝑇wβ˜…βˆ’β†’π‘ƒ 𝛼¯ .

Proof. Recall that||wβ˜…||2 = 𝑝 πœ…2. Applying Theorem 11 withΞ¨(𝑒, 𝑣)=𝑒𝑣gives, 1

||wβ˜…||2wˆ𝑇wβ˜…= 1 πœ…2𝑝

𝑝

Γ•

𝑗=1

wˆ𝑗wβ˜…π‘—

βˆ’β†’π‘ƒ 1 πœ…2E

π‘Š Proxπœ†πœŽΒ―πœΒ―π‘“Λœ(Β·) 𝜎¯𝜏¯(πœƒπ‘ŠΒ― +π‘ŸΒ―

√

𝛿 𝑍) =𝛼 ,Β― (8.7) where the last equality is derived from the first equation in the nonlinear system (𝑁 𝐿 𝑆), along with the fact that (𝛼,Β― 𝜎,Β― 𝛾 ,Β― πœƒ ,Β― 𝜏,Β― π‘ŸΒ―)is the solution to this nonlinear system.

Corollary 6 states that upon centering Λ†waround ¯𝛼wβ˜…, it becomes uncorrelated fromwβ˜…. Therefore, we define ˜w := wΛ†

Β―

𝛼 to be an unbiased estimator of wβ˜…. The following corollary computes the mean-squared error of ˜w.

101

Corollary 7. As 𝑝→ ∞, 1𝑝||w˜ βˆ’wβ˜…||2 βˆ’β†’π‘ƒ 𝜎¯2

Β― 𝛼2 .

Proof. We appeal to Theorem 11 withΞ¨(𝑒, 𝑣)= (π‘’βˆ’π›Όπ‘£Β― )2, 1

𝑝

||w˜ βˆ’wβ˜…||2= 1

Β― 𝛼2

1 𝑝

||wΛ† βˆ’π›ΌΒ―wβ˜…||2 𝑃

βˆ’β†’ 1

Β― 𝛼2E

Proxπœ†πœŽΒ―πœΒ―π‘“Λœ(Β·) 𝜎¯𝜏¯(πœƒπ‘ŠΒ― +π‘ŸΒ―

√ 𝛿 𝑍)

βˆ’π›Όπ‘ŠΒ― 2

= 𝜎¯2

Β― 𝛼2

, (8.8) where the last equality is derived from the third equation in the nonlinear system (𝑁 𝐿 𝑆) together

with the result of Corollary 6.

In the next two sections, we investigate other properties of the estimate Λ†w under β„“

1 and β„“2

2

regularization.