Theoretical Proof - 3.3.5.1 Local-institution Summary Statistics in Newton 145

V. 3.3.5.1 Local-institution Summary Statistics in Newton 145

V.3.4 Theoretical Proof

In this section, we first analyze the model convergence properties of our QuickLogit proposal in terms of convergence guarantees and rate. We then prove that our proposal guarantees much faster speed and better model quality in practice.

V.3.4.1 Same Theoretical Convergence as Newton

Our QuickLogit can essentially be regarded as Newton method with a carefully chosen (in- stead of random) initialization. This means that it should at least share the same theoretical convergence properties as Newton method, whose convergence is well established in liter- ature. We thus briefly summarize the convergence of QuickLogit below (similar to Newton method) (interested readers are encouraged to read related proof from classical numerical

optimization textbooks [19, 104]):

(a) QuickLogit can often converge to the global optimum.

(b) QuickLogit has quadratic convergence rate.

V.3.4.2 Better Practical convergence than Newton.

The aforementioned equivalence of convergence between QuickLogit and Newton is mainly from a sense of Big-O notation (i.e., order of magnitude). In practice, our proposal tends to be much better than the Newton method both in terms of model quality and the amortized rate of convergence. This is because our carefully chosen initialization proves to have bounded error and is often much closer to the optimum than some random initialization (the latter is commonly pursued by standard Newton method), a condition for Newton to reach its promised convergence. Our approach can potentially address two well-known limitations of standard Newton method: 1) when faced with poor initializations (e.g., random guesses that are arbitrarily far away), Newton method may divert severely and never reach the optimum; 2) with poor initializations, Newton do not guarantee the fast quadratic convergence rate (not at least in the first many iterations).

In below, we provide theoretical evidence to support our claim on the superior convergence of QuickLogit. Briefly, we prove in two steps:

(a) We prove that our carefully chosen initialization has bounded error to the optimum, thus being more likely to fall in the vicinity of the optimum than random initialization.

(b) We prove that Newton method with an initialization nearer to optimum leads to ac- celerated convergence.

In order to present the statistical properties of our QuickLogit algorithm, we first state the standard regularity assumptions on the parameter space and loss functions, which are commonly encountered in the context of asymptotic statistics [146, 169].

Assumption 1. The parameter space is a compact convex set.

Assumption 2. In the neighborhood of the optimal value β^∗, the objective function is locally strongly concave and the Hessian matrix atβ^∗is negative definite.

Assumption 3. The first, second and third partial derivatives of the objective function exist and are bounded. To be more specific, moments of the derivatives are bounded by G, L, and M respectively.

Assumption 4. The samples are evenly distributed among S local nodes.

Assumption 1 to 3 are clearly satisfied in our distributed logistic regression model.

Assumptions 4 is assumed mainly to simplify the notation without losing generalization. In practice, multiparty datasets are often of equivalent sample size, rendering this assumption valid.

Under these regularity assumptions, some recent work [169, 133] examines the statistical properties of the averaged estimator and derived the bound of its mean square error. It is shown that the mean square error of the averaged estimator ¯β is bounded above by

E[kβ¯−β^∗k²₂]≤O( G²

λ²N+G⁴M²S²

λ⁶N² +L²G²log(d)S²

λ⁴N² ), (V.24)

whereN is the total number of samples andS is the number of nodes [137]. In addition, since the objective function is Lipschitz continuous with Lipschitz constant L, Equation V.24 implies that the suboptimality is bounded by

E[l₂(β¯)]−l₂(β^∗)≤O(LG²

λ²N+LG⁴M²S²

λ⁶N² +L³G²log(d)S²

λ⁴N² ) (V.25)

Therefore, compared to a random initial value, ¯β has the advantage of being more likely to

be close to the optimumβ^∗. In particular, by Markov’s inequality, for anyε>0,

P(kβ¯−β^∗k²₂≥ε)

≤ E(kβ¯−β^∗k²₂) ε

≤ O( G²

ε λ²N+G⁴M²S²

ε λ⁶N² +L²G²log(d)S² ε λ⁴N² )

When N goes to ∞, the probability of ¯β being far away from β^∗ will converge to zero.

Moreover, if the number of nodesSis in the order of O(√

N), then ¯β is a√

N-consistent estimator ofβ^∗since the MSE of ¯β will be bounded above byO(N⁻¹)from Equation V.24.

For Newton method, it is critical to find an initial value close to the optimal solution.

The convergence process of Newton iterates usually falls into two stages [19]. The first stage is called a damped Newton phase, which is quantified by the conditionk∇l₂(β)k₂≥η for some constant η. During this phase, there exists a number γ >0 such that objective function value increases by at leastγ per iteration,

l₂(β_k+1)−l₂(β_k)≥γ

When the estimate is close enough to the optimal solution, namely k∇l₂(β)k₂ ≤η, the optimization process enters the second stage – the quadratically convergent phase, where the estimateβ_k converges to the optimal solutionβ^∗quadratically:

l₂(β_k+1)−l₂(β^∗)≤C[l₂(β_k)−l₂(β^∗)]²

for some constantC. In conclusion, the number of iterations until l₂(β^∗)−l₂(β_k)≤ε is bounded above by

l₂(β^∗)−l₂(β₀)

γ +log(log(ε₀/ε)), (V.26)

whereβ₀is the starting value,γ andε₀are constants depending on the objective function.

The second term is usually small due to the quadratic convergence speed. The dominating term is the first term and the closer the starting value is to the optimal solution, the smaller the iteration number during this phase will be. Equation V.25 shows that the suboptimality of averaged estimator is bounded, which implies that the first term in Equation V.26 is also bounded compared to a random starting valuesβ0. Moreover, this bound will converge to zero as more samples are collected at each institution.

V.3.4.3 Computational complexity

Regarding computational complexity, we observe that QuickLogit has similar Big-O complexity as the standard (distributed) Newton. However, our proposal has drastically lower amortized cost (as determined by a large constant, which directly translates into our speedup).

Since secure computing involving cryptography incurs orders of magnitudes more expen- sive computation than non-secure counterparts, this literally means that our computational complexity is dominated by center-based cryptographic operations which thus becomes our focus of analysis.

Table V.5: Computational complexity of secure subprotocols.

Operation Complexity Note

Cholesky decomposition O(p³) Hessian inversion Back substitution O(p²) Hessian inversion Newton method O(p³) Per-iteration

QuickLogit O(p³) Per-iteration

Specifically, the second stage of our proposal still relies on traditional Newton updat- ing iterations, which has per-iteration complexity ofO(p³)(note that distributed nodes do not have privacy issues for both methods, thus the other termnp² of standard Newton is reduced here). For QuickLogit, when taking into account the one-time preprocessing for centrally aggregating local models which has complexityO(p)(which is negligible), our total complexity is roughly:O(p³×QuickLogit iterations). For the traditional Newton, the complexity is:O(p³×Newton iterations). As is obvious, the complexity difference

is mainly due to the number of iterations to model convergence. As has been proven both theoretically and empirically (later), the iterations required by QuickLogit is substantially fewer than that of Newton. This directly translates into significant time saving.

Dalam dokumen privacy leaks and efficient countermeasures for human (Halaman 161-166)