From Network Dynamics to Optimization Algorithms

In Part I of this thesis, we study one of the most important families of network dynamics, namely epidemics or spreading processes. In Part IV of this thesis, we characterize several properties, such as minimax optimality and implicit regularization, of SGD and more generally of the family of stochastic mirror descent (SMD).

LIST OF ILLUSTRATIONS

109 5.3 The profit under the normal (no restriction) condition and under. optimal) strategic constraint, as a function of the size of the aggregator in the IEEE test case network: a) IEEE 14-Bus Case, b) IEEE 30-Bus Case and c) IEEE 57-Bus Case. Note that each of the four histograms corresponds to an 11×106-dimensional weight vector that perfectly interpolates the data.

LIST OF TABLES

INTRODUCTION

Major Challenges
Synopsis of Part I: Network Dynamics
Synopsis of Part II: Incentives and Markets
Synopsis of Part III: Distributed Computation
Synopsis of Part IV: Learning from Data

Distributed computing: The massive computational needs due to the scale of the system (and the fact that data can be distributed across multiple entities) make it virtually impossible to perform computations in a central unit. Moreover, none of the existing schemes is accompanied by a computational algorithm for non-convex overheads.

Network Dynamics

EPIDEMICS OVER COMPLEX NETWORKS: ANALYSIS OF EXACT AND APPROXIMATE MODELS

Introduction

A healthy node has a chance to become infected if it has infected neighbors in the network. Apparently, because the analysis of this Markov chain is too complicated, various 𝑛-dimensional linear and nonlinear approaches have been proposed in the literature.

Models

Susceptible-Infected-Susceptible (SIS) .1 Exact Markov Chain Model

Nonlinear Model
Linear Model

Susceptible-Infected-Recovered-Susceptible (SIRS)

Exact Markov Chain Model
Nonlinear Model
Linear Model

Susceptible-Infected-Vaccinated (SIV) .1 Exact Markov Chain Model

Nonlinear Model
Linear Model

However, the transition matrix of the discrete-time Markov chain model can have non-zero entries anywhere (except the row of the absorbing state). 𝑃𝑛(𝑡) = 0) is a trivial fixed point of the above model, which is consistent with the absorptive state of the Markov chain model.

Figure 2.1: State diagram of a single node in different models. Wavy arrows represent exogenous (network-based) transition

Results on the Nonlinear MFA Model

SIRS/SEIRS/Immune-Admitting-SIS
SIV/SEIV (Infection-Dominant)
SIV/SEIV (Vaccination-Dominant)

The disease-free fixed point of the nonlinear model for the infection-dominant SIV and the infection-dominant SEIV epidemics is . The disease-free fixed point of the nonlinear model for the vaccination-dominant SIV and the vaccination-dominant SEIV epidemics is .

Figure 2.3: Summary of known results for different models. The results have been illustrated as a function of 𝛽𝜆 𝑚𝑎 𝑥 𝛿 ( 𝐴)

Results on the Exact Markov Chain Model

Connection to the Linear Model
Connection to the Nonlinear Model

However, it turns out that it provides an upper bound on the probability that the chain is not in the fully healthy state (i.e., the existence of infection) [4], if one initializes the nonlinear model from the state of all infected . Finally we should note that the reason why it is possible for the nonlinear map to converge to a unique non-origin fixed point when 𝛽𝜆max𝛿(𝐴) >1, even though the original Markov chain model always converges to the fully healthy state, is that this is only an upper bound on P(𝜉(𝑡) ≠ ¯0|𝜉(0) = 𝑋).

Figure 2.4: A typical example of the evolution of an SIS epidemic over an Erdős- Erdős-Rényi graph with 𝑛 = 2000 nodes and 𝜆 max ( 𝐴 ) = 16

Heterogeneous Network Models

The origin is the unique fixed point that is globally stable if the largest eigenvalue of 𝑀 is less than 1. It also has a unique nontrivial fixed point that is globally stable if the largest eigenvalue of 𝑀 is greater than 1.

Pairwise and Higher-Order Approximate Models

Similar to the previous examples, 𝜆max(𝑀) < 1 ensures that the mixing time of the Markov chain defined by the transition matrix 𝑆(𝑀) is 𝑂(log𝑛).

Summary and Conclusion

Finally, we should note that characterizing the exact epidemic threshold of the Markov chain model is still an open problem. The nonlinear model has the same Jacobian matrix as that of the previous section.

Figure 2.5: The evolution of (a) SIS/SIRS/SEIRS, (b) SIV/SEIV (infection-dominant), (c) SIV/SIEV (vaccination-dominant) epidemics over an Erdős-Rényi graph with 𝑛 = 2000 nodes

IMPROVED BOUNDS ON THE EPIDEMIC THRESHOLD OF THE EXACT MODELS

Introduction
The Markov Chain and Marginal Probabilities of Infection

An Alternative Bounding Technique
Connection to Mixing Time of the Markov Chain
A Lower Bound on the 𝑝 𝑖 𝑗 ’s

An Alternative Pairwise Probability ( 𝑞 𝑖 𝑗 )
Conclusion and Future Work

If 𝜌(𝑀0) < 1, then the mixing time of the Markov chain whose transition matrix 𝑆 is described by Eq. If 𝜌(𝑀00) < 1and1−𝛿−𝛽 ≥ 0, then the mixing time of the Markov chain whose transition matrix𝑆 is described by Eq.

Table 3.1: Performance of the proposed bounds 𝑀 0 and 𝑀 00 in comparison with the previous bound 𝑀

Incentives and Markets

OPTIMAL PRICING IN MARKETS WITH NON-CONVEX COSTS

Introduction

We formalize the desired properties in the literature in Section 4.2 and discuss the properties of the existing schemes in Section 4.5. For example, most of the existing schemes mentioned above are proposed for specific classes of non-convex cost functions and cannot handle more general non-convex ones.

Market Description and Pricing Objectives

Market Model
Pricing Objectives

Although in certain cases (e.g. IP pricing [159] for start-up and linear costs) the total cost of suppliers is Í𝑛. Ultimately, the quantity that determines the cost of meeting demand is the total payment to suppliers Í𝑛.

Proposed Scheme: Equilibrium-Constrained Pricing

Pricing Formulation

Linear+Uplift Pricing

An Efficient Approximation Algorithm

Constraint (4.1d) can be equivalently expressed as 4.2) The key difference between EC pricing and existing pricing methods in non-convex markets is that it directly minimizes the total cost paid and tries to find both optimal allocations𝑞∗. Therefore, minimizing the total payment also limits the total production cost, while the opposite is generally not true (minimizing the total production cost can result in very high payments, as can be seen in e.g. the case studies in Figures 4.4a and 4.5a).

Figure 4.1: An illustration of the set Λ for an example with 3 non-convex cost functions

Equilibrium-Constrained Pricing for Networked Markets

Pricing Formulation

When the capacity constraints (4.14c) are relaxed (𝑓𝑒 =−∞, 𝑓𝑒 =∞, ∀𝑒 ∈ 𝐸), the network problem is reduced to the internal market problem. As in the case of the internal market, we define an approximate solution to this problem.

Existing Pricing Schemes

Pricing in Convex Markets
Pricing in Non-Convex Markets

However, IP pricing assumes knowledge of the optimal solutions to the unit commitment problem and is therefore not intended as a practical approach to finding the optimal allocation. The scheme is based on the formulation and solution of the SLR of the mixed integer program by the market clearing constraint with standard Lagrange multiplier𝜆 semi-relaxed.

Figure 4.3: An illustration of shadow pricing for the case of 3 convex cost functions.

Experimental Results

Case 1: Linear plus startup cost
Case 2: Quadratic plus startup cost
A Networked Market with Capacity Constraints

As you can see in Figure 4-5a, EC1, EC2, EC3, and EC4 reach the possible minimum total payments equal to the total cost. However, it helps to reduce the overall increases, as we can see in Figure 4.7b. a)Total payments as a function of power capacity.

Table 4.2: Summary of the production characteristics in the modified Scarf’s example.

Concluding Remarks

In the optimization problem (4.6), the order of the variables in the minimizations does not matter, and further, for any fixed𝑞1,. Intermediate Nodes: At each new level, there are at most half (+1) nodes as in the previous level.

Figure 4.8: The transformation of an arbitrary-degree tree to a binary tree.

MANAGING AGGREGATORS IN THE SMART GRID

Introduction

Summary of Contributions
Related Work

Quantifying Market Power in Electricity Markets
Cyber-Attacks in the Grid
Algorithms for Managing Distributed Energy Resources
Algorithms for Bilevel Programs

On the aggregator side, managing a geographically diverse fleet of distributed energy resources is a difficult algorithmic challenge. On the operator's side, the participation of aggregators in electricity markets poses unique challenges in terms of monitoring and limiting the potential of exercising market power.

System Model

Preliminaries

In contrast, our work presents a polynomial time algorithm that arguably maximizes the gain of the aggregator. The ex-post LMPs are announced as a function of the optimal Lagrange multipliers of this optimization.

The Market Behavior of the Aggregator

A Profit-Maximizing Aggregator

Since LMPs are themselves the solution to an optimization problem, the aggregator's problem is a bilevel optimization problem. An important note about this problem is that we have assumed that the aggregator has complete knowledge of the network topology (G) and state estimates (pandd).

The Impact of Strategic Curtailment

An Illustrative Example
Case Studies

Assume that the total collector generation at each bus is 10𝑀 𝑊 and is able to limit 1% of it (0.1 𝑀 𝑊). As the size of the aggregator (its number of buses) increases, not only does the profit (which is expected) increase, but also.

Figure 5.2: The locational marginal prices for the 6-bus example before and after the curtailment.

Optimizing Curtailment Profit

An Exact Algorithm for Single-Node Aggregators in Arbitrary Networks Even in the simplest case, when the aggregator has only a single node, i.e., its entire
An Approximation Algorithm for Multi-Bus Aggregators in Radial Net- works
Evaluation of the Approximation Algorithm

The other important question is what is the impact of strategic curtailment on the price of each bus of the network (not necessarily just the aggregator's buses). In particular, we show that a 𝜖 approximation of the optimal throttling gain can be obtained using an algorithm with running time that is linear in the size of the network and polynomial in 1𝜖.

Figure 5.5: The LMP at bus 𝑖 as a function of curtailed generation at that bus. Shaded areas indicate the aggregator’s revenue at the normal condition and at the curtailment.

Concluding Remarks

In this definition, the value of 𝜂𝑖 captures the ability of the generator/aggregator to exercise market power. Focusing on expressions involving𝛼𝑖 for a given𝑖, the objective of the above optimization problem is of the form: (Δ𝑝.

Figure 5.8: The difference from the optimal solution as a function of the running time of the algorithm, in the 9-bus network with 1% curtailment allowance.

Distributed Computation

DISTRIBUTED SOLUTION OF LARGE-SCALE SYSTEMS OF EQUATIONS

Introduction

In addition to the optimization-based methods, there are some distributed algorithms specially designed for solving systems of linear equations. We provide a full analysis of the convergence rate of APC (Section 6.3), as well as a detailed comparison with all other distributed methods mentioned above (Section 6.4).

The Setup

In our methodology, the task master assigns a subset of equations to each of the machines and invokes a distributed consensus-based algorithm to obtain the solution to the original problem in an iterative manner. At each iteration, each machine updates its solution by adding a scaled version of the projection of an error signal onto the null space of its system of equations, and the taskmaster averages over the solutions with momentum.

Accelerated Projection-Based Consensus .1 The Algorithm.1The Algorithm

Convergence Analysis
Computation and Communication Complexity

We analyze the convergence of the proposed algorithm and prove that it has linear convergence (i.e. the error decays exponentially), with no additional assumption imposed. The convergence rate of the algorithm is determined by the spectral radius (largest eigenvalue) of the (𝑚 +1)𝑛× (𝑚 +1)𝑛 block matrix in (6.10).

Comparison with Related Methods .1 Distributed Gradient Descent (DGD).1Distributed Gradient Descent (DGD)

Distributed Nesterov’s Accelerated Gradient Descent (D-NAG)
Distributed Heavy-Ball Method (D-HBM)
Alternating Direction Method of Multipliers (ADMM)
Block Cimmino Method
Consensus Algorithm of Mou et al

We should also note that the computational complexity of ADMM is 𝑂(𝑝 𝑛) per iteration (the inverse is calculated using the matrix inversion lemma), which is again the same as for gradient-type methods and APC. It is not difficult to show that the optimal convergence rate of the Cimmino method is

Underdetermined System

Note that this is exactly the same as in the proof of Theorem 31, where 𝑋 is replaced by 𝑌. This implies that the nullity of the matrix in (6.26) is 𝑛and every steady state solution must be a consensus solution, which satisfies the proof.

Table 6.2: A comparison between the condition numbers of 𝐴 𝑇 𝐴 and 𝑋 for some examples

Experimental Results

To further verify the performance of the proposed algorithm, we also run all the algorithms on multiple problems, and observe the actual decay of the error. We should also note that initialization does not seem to affect the convergence behavior of our algorithm.

Figure 6.2: The decay of the error for different distributed algorithms, on two real problems from Matrix Market [141] (QC324: Model of 𝐻 +

A Distributed Preconditioning to Improve Gradient-Based Methods The noticeable similarity between the optimal convergence rate of APC (The noticeable similarity between the optimal convergence rate of APC (

Again, to make the comparison fair, all methods are tuned to their optimal parameters. As you can see, APC far outperforms the other methods, which is consistent with the order of magnitude of the differences in the convergence times of Table 6.3.

Conclusion

CODED COMPUTATION FOR DISTRIBUTED GRADIENT DESCENT

Introduction

Related Work
Statement of Contributions

Preliminaries .1 Problem Setup.1Problem Setup

Computational Trade-offs

Code Construction

Balanced Mask Matrices
Correctness of Algorithm 6
General construction

Column-balanced Mask Matrix M Input

Correctness of Algorithm 7
Building the Encoding Matrix from the Mask Matrix
Efficient Online Decoding
Analysis of Total Computation Time
Numerical Results
Conclusion

Here, the quantity 𝑇consists of the time it takes for the machine to calculate its part of the gradient. For this value of 𝛼, the number of machines required to successfully restore the gradient is given by .

Figure 7.1: Schematic representation of the taskmaster and the 𝑛 workers.

Learning from Data

MINIMAX OPTIMALITY AND IMPLICIT REGULARIZATION OF STOCHASTIC GRADIENT/MIRROR DESCENT

Introduction

Our Contribution

Therefore, a general characterization of the behavior of stochastic discount algorithms for more general models would be of great interest. The theory also allows us to establish new results, such as the convergence (in a deterministic sense) of SMD in the over-parametrized linear case.

Preliminaries

Furthermore, we show that many properties recently proven in the learning/optimization literature, such as the implicit regularization of SMD in the overparameterized linear case—when convergence occurs—[85], follow naturally from this theory. We also use the theory developed in this chapter to provide some speculative arguments as to why SMD (and SGD) may have similar convergence and implicit regularization properties in the so-called "highly overparameterized".

Warm-up: Revisiting SGD on Square Loss of Linear Models

Conservation of Uncertainty
Minimax Optimality of SGD

We assume that the loss𝐿𝑖(·) depends only on the residual, i.e., the difference between the prediction and the true label. However, in this case, the convergence is not surprising - since, effectively, after a while, the weights are no longer updating - and the more interesting question is "what" the recursion converges to.