Chapter VIII: Precise Performance Analysis of Regularized Logistic Regression in High
8.3 Characterization of the Performance of Regularized Logistic Regression
In this section, we present the main results of this chapter, that is the characterization of the asymptotic performance of regularized logistic regression (RLR). When the estimation performance is measured via a locally-Lipschitz function (e.g. mean-squared error), Theorem 11 precisely predicts the asymptotic behavior of the error. The derived expression captures the role of the regularizer, π(Β·), and the particular distribution ofwβ , through a set of scalars derived by solving a system of nonlinear equations. We first present this system of nonlinear equations along with some insights on how to numerically compute its solution. Afterwards, we formally state our result in Theorem 11. The result of this theorem will then be used to predict the general behavior of Λw. In particular, we compute its correlation with the true signal as well as its mean-squared error.
A nonlinear system of equations
As we will see in Theorem 11, given the signal strengthπ and the ratioπΏ, the asymptotic performance of RLR is characterized by the solution to the following system of nonlinear equations with six unknowns(πΌ, π, πΎ , π , π, π).









ο£²









ο£³
π 2πΌ = E
π Proxππ ππΛ(Β·) π π(ππ +π
β
πΏ π) , πΎ =
β πΏ π E
πProxππ ππΛ(Β·) π π(ππ +π
β
πΏ π) , π 2πΌ2+π2 = E
Proxππ ππΛ(Β·) π π(ππ +π
β πΏ π)2
, πΎ2 = 2
π2 E
π0(βπ π
1) π πΌ π
1+π π
2βProxπΎ π(Β·)(π πΌ π
1+π π
2)2 , π πΎ =β2E
π00(βπ π
1)ProxπΎ π(Β·) π πΌ π
1+π π
2 , 1β πΎ
π π
= E 2π0(βπ π
1) 1+πΎ π00 ProxπΎ π(Β·)(π πΌ π
1+π π
2) .
(π πΏ π)
Hereπ , π
1, π
2are standard normal variables, andπ βΌ Ξ , whereΞ denotes the distribution on the entries ofwβ . The following remarks provide some insights on solving this nonlinear system.
Remark 9 (Proximal Operators). It is worth noting that the equations in (π πΏ π) include the expectation of functionals of two proximal operators. The first three equations are in terms
99
of ProxπΛ(Β·), which can be computed explicitly for most widely used regularizers. For instance, in β
1-regularization, the proximal operator is the well-known shrinkage function defined as π(π₯ , π‘) := |π₯π₯|(|π₯| βπ‘)+. The remaining equations depend on computing the proximal operator of the link function π(Β·). Forπ₯ βR, Proxπ‘ π(Β·)(π₯)is the unique solution ofπ§+π‘ π0(π§) =π₯.
Remark 10(Numerical Evaluation). Definev:= [πΌ, π, πΎ , π , π, π]π as the vector of unknowns. The nonlinear system (π πΏ π) can be reformulated asv = π(v) for a properly defined π : R6 β R6. We have empirically observed in our numerical simulations that a fixed-point iterative method, vπ‘+1 =π(vπ‘), converges tovβ, such thatvβ =π(vβ).
Asymptotic performance of regularized logistic regression
We are now able to present our main result. Theorem 11 below describes the average behavior of the entries of Λw, the solution of the RLR. The derived expression is in terms of the solution of the nonlinear system (π πΏ π), denoted by(πΌ,Β― π,Β― πΎ ,Β― π ,Β― π,Β― πΒ―). An informal statement of our result is that as πβ β, the entries of Λwconverge as follows,
wΛπ
βπ Ξ(wβ π, π), for π =1,2, . . . , π , (8.4) whereπ is a standard normal random variable, andΞ:R2βRis defined as,
Ξ(π, π) :=ProxππΒ―πΒ―πΛ(Β·) πΒ―πΒ―(π πΒ― +πΒ―
β πΏ π)
. (8.5)
In other words, the RLR solution has the same behavior as applying the proximal operator on the
"perturbed signal", i.e., the true signal added with a Gaussian noise.
Theorem 11. Consider the optimization program (8.3), where for π = 1,2, . . . , π, xπ has the multivariate Gaussian distributionN (0, 1
πIπ), andπ¦π =RAD π(xππwβ )
, and the entries ofwβ are drawn independently from a distributionΞ . Assume that the parametersπΏ,π , andπare such that the nonlinear system (π πΏ π) has a unique solution (πΌ,Β― π,Β― πΎ ,Β― π ,Β― π,Β― πΒ―). Then, as π β β, for any locally-Lipschitz2functionΞ¨:RΓRβ R, we have,
1 π
π
Γ
π=1
Ξ¨(wΛπ,wβ π) ββP E
Ξ¨ Ξ(π , π), π , (8.6)
whereπ βΌ N (0,1),π βΌΞ is independent ofπ, and the functionΞ(Β·,Β·)is defined in(8.5).
2A functionΞ¨:RπβRis said to belocally-Lipschitzif,
βπ >0, βπΏπ β₯0, such that βx,yβ
βπ ,+ππ
: |Ξ¨(x) βΞ¨(y) | β€ πΏπ||xβy||.
We defer the detailed proof to Section 8.7. In short, to show this result we first represent the optimization as a bilinear form,uπXv, whereXis the measurement matrix. Applying the CGMT to derive an equivalent optimization, we then simplify this optimization to obtain an unconstrained optimization with six scalar variables. The nonlinear system (π πΏ π) represents the first-order optimality condition of the resulting scalar optimization.
Before stating the consequences of this result, a few remarks are in order.
Remark 11(Assumptions). The assumptions in Theorem 11 are chosen in a conservative manner.
In particular, we could relax the separability condition on π(Β·)to some milder condition in terms of asymptotic convergence of its proximal operator. Furthermore, one can relax the assumption on the entries ofwβ being i.i.d. to a weaker assumption on the empirical distribution of its entries.
However, for the applications discussed in this chapter, the theorem in its current form is adequate.
In Chapter 9, we will perform a more general analysis with less restrictive assumptions.
Remark 12 (Choosing Ξ¨). The performance measure in Theorem 11 is computed in terms of evaluation of a locally-Lipschitz function,Ξ¨(Β·,Β·). As an example,Ξ¨(π’, π£)= (π’βπ£)2can be used to compute the mean-squared error. In the next section, we will appeal to this theorem with various choices ofΞ¨to evaluate different performance measures onwΛ.
Correlation and variance of the RLR estimate
As the first application of Theorem 11, we compute common descriptive statistics of the estimate wΛ. In the following corollaries, we establish that the parameters Β―πΌ, and Β―πin (π πΏ π), respectively, correspond to the correlation and the mean-squared error of the resulting estimate.
Corollary 6. As πβ β, ||w1β ||2 wΛπwβ ββπ πΌΒ― .
Proof. Recall that||wβ ||2 = π π 2. Applying Theorem 11 withΞ¨(π’, π£)=π’π£gives, 1
||wβ ||2wΛπwβ = 1 π 2π
π
Γ
π=1
wΛπwβ π
ββπ 1 π 2E
π ProxππΒ―πΒ―πΛ(Β·) πΒ―πΒ―(ππΒ― +πΒ―
β
πΏ π) =πΌ ,Β― (8.7) where the last equality is derived from the first equation in the nonlinear system (π πΏ π), along with the fact that (πΌ,Β― π,Β― πΎ ,Β― π ,Β― π,Β― πΒ―)is the solution to this nonlinear system.
Corollary 6 states that upon centering Λwaround Β―πΌwβ , it becomes uncorrelated fromwβ . Therefore, we define Λw := wΛ
Β―
πΌ to be an unbiased estimator of wβ . The following corollary computes the mean-squared error of Λw.
101
Corollary 7. As πβ β, 1π||wΛ βwβ ||2 ββπ πΒ―2
Β― πΌ2 .
Proof. We appeal to Theorem 11 withΞ¨(π’, π£)= (π’βπΌπ£Β― )2, 1
π
||wΛ βwβ ||2= 1
Β― πΌ2
1 π
||wΛ βπΌΒ―wβ ||2 π
ββ 1
Β― πΌ2E
ProxππΒ―πΒ―πΛ(Β·) πΒ―πΒ―(ππΒ― +πΒ―
β πΏ π)
βπΌπΒ― 2
= πΒ―2
Β― πΌ2
, (8.8) where the last equality is derived from the third equation in the nonlinear system (π πΏ π) together
with the result of Corollary 6.
In the next two sections, we investigate other properties of the estimate Λw under β
1 and β2
2
regularization.