Fault-tolerant Distributed Data Centers

The work in this thesis addresses the problems of cost-aware capacity provisioning and load balancing in fault-tolerant GDCs (to mask single-site failures). In this thesis, we also address another important problem, cost-aware load balancing in large-scale fault-tolerant GDCs.

Introduction

Motivation of the Research Work

In this thesis we consider two models for using renewable energy for green data center design. Each data center is characterized by spatiotemporal variation in electricity price, renewable energy availability and failure rate.

Contributions of the Thesis

Cost-aware Provisioning of Spare Capacity for Fault- tolerant GDCstolerant GDCs
Capacity Planning in Fault-tolerant GDCs Collocated with Renewable Energy Sourceswith Renewable Energy Sources
Optimizing Energy Cost in Fault-tolerant GDCs Sat- isfying Green Energy Boundisfying Green Energy Bound
Game-theoretic Model for Load Balancing in Fault- tolerant GDCstolerant GDCs
Distributed Failure Detection and Efficient Load Bal- ancing in Fault-tolerant GDCsancing in Fault-tolerant GDCs
Organisation of the Thesis

Most of the work in the literature on load balancing in distributed data centers focuses on centrally solving the proposed solution. In Chapter 7, we propose a data center-initiated distributed load balancing algorithm to ensure QoS after an outage while minimizing operating costs.

Architecture of a GDC

We discuss the popular model to ensure high availability in GDCs and the efforts to minimize operating costs. We also show how demand multiplexing can be leveraged to minimize the operating costs of GDCs.

High Availability Requirement

We consider this approach to provide spare capacity across the sites to mask the unavailability of the data center at any site. We have assumed that the failure of the data center at a site is an independent process, that is, simultaneous failure of the data center at more than one location is rare.

Energy Cost Components and their Dynamics

Brown Energy Pricing
Renewable Energy Sources and Cost Model
Demand Multiplexing for Improving Utilization
Geographical Load Balancing

On-site renewable energy generation: In this model, renewable energy sources such as solar panels and wind turbines are connected to the data center as shown in the figure. Renewable Energy Certificate (REC): Also known as a green certificate, it is a market-based instrument to promote renewable energy and facilitate the fulfillment of renewable energy purchase obligations (RPO).

Figure 2.2: Various renewable energy options

Related Work

Data Center Placement and Capacity Provisioning
Geo-Distributed Load Balancing Approaches

The authors of [18] proposed an optimization framework for data center placement/capacity provisioning and demand flow control/resource allocation in a common way. There are several studies that have taken into account additional factors such as powering the data center using renewable energy sources (at least partially).

Summary

Therefore, the work in this thesis proposes optimization models and algorithms for the cost-aware design of fault-tolerant GDCs and their load balancing. In the next chapter, we discuss the cost-aware spare capacity provisioning problem and the optimization model for the same.

Introduction

Motivation

However, multi-site replication involves large replication costs as the data center operators are typically charged for the number of bytes transferred [63] and/or the bandwidth costs between the replication sites [16]. Although the basic problem was similar, we used the minimization of the TCO as an objective, apart from dealing with the replication costs. Therefore, we used the minimization of the TCO as an objective in the provision of spare capacity while considering different models for data replication.

MILP Model Formulation

Assumptions
System Model
Cost Models
CACP Model
Example for Working of the CACP Model
Complexity Analysis

In this case, the data is replicated from a primary data center to the nearest data center. Power consumption: Let θhs denote the electricity price at data center location s in hour h of the day. The number of servers allocated across all data center locations is the same with the MS model.

Table 3.1: Summary of notation used in the model

Numerical Results

System Parameters
Results

We also studied how the choice of data center locations affects the TCO with the CACP model. Therefore, the CACP model (without replication costs) lowers the TCO by about 20% compared to the model with replication. In the plot, CACP-w/ft and CACP-w/o ft indicate the TCO achieved using the CACP model with and without failure, respectively.

Table 3.4 shows the split of workload across different client regions obtained from the number of Internet users

Conclusion

In this chapter we have assumed that a data center at a site has completely failed and also that the data centers are powered only by brown energy sources. In the next chapter, we look at a common failure model (partial and complete failure at a site) and data centers co-located with renewable energy sources. For such GDCs, we aim to determine the optimal server distribution that minimizes the total cost while maximizing the use of renewable energy.

Introduction

An interesting aspect of capacity provision in green data centers is that the cost of powering servers is highly dependent on spatio-temporal variation in electricity price, availability of green energy, and user demand (for daily applications). Although renewable energy production is highly intermittent, it is found to become more predictable with an increase in the number of renewable energy sources connected to the grid across multiple locations [74]. This is due to the effect of geographical diversity and the law of large numbers, i.e. the large number of geographically distributed renewable energy sources installed makes the availability of renewable energy more predictable with a certain degree of accuracy [27, 28].

MILP Framework

System Setup and Assumptions
Definitions and Cost Model
Optimization Problem Formulation

Ghs is the actual renewable energy generated (both wind and solar energy) at a data center s during hourh. For each data center s, hourh and failed data center f, we denote GUsf has the renewable energy used, GSsf h as the renewable energy sold (net metering) and Zsf h as the energy supplied to the battery (when Zsf h > 0) or consumed from the battery (when Zsf h < 0). The battery level at any data centers, during hour h, after the fifth data center has failed, is given by .

Figure 4.1: Architecture of renewable energy powered GDC

Numerical Results

Experimental Setup
Results

We see that the TCO for the GCACP model decreases as the number of data centers increases due to demand multiplexing and variations in electricity prices and renewable energy availability. 4.2, there is a slight increase in costs as the number of data centers with an MS model increases. We can conclude that the GCACP model is beneficial when not all data centers are operating at peak usage.

Table 4.2: Decision variables of the model

Conclusion

However, we can see that even if a data center fails completely, the GCACP solution yields about 19% and 46% lower cost relative to the MS and CDN models, respectively. We conclude that with partial failure, the GCACP model is advantageous due to the lower reserve capacity requirement and the lower brown energy consumption (since the available renewable energy is constant). Considering the fact that data center operators try to gradually increase their use of renewable energy, we model the problem of providing spare capacity to meet a target use of green energy at a minimum cost, when the data center data is enabled by a combination of brown and green. sources of energy.

Introduction

Mostly we observe that the frequency of partial data center failures is very high, while complete data center failure is rare (maybe once every two years) [80]. We provide spare capacity in data centers so that demand is met even after a data center site failure (either partial or total) while reducing TCO. Our model solution provides optimal distribution of servers across sites and demand distribution that minimizes TCO.

Optimization Model

System Architecture
System Model
Cost Model

Each data center consolidates the workload to keep power consumption proportional to the workload. Failure in more than one data center at the same time is avoided by choosing locations. Let λaf hsu denote the number of requests mapped from client region u to data center s (s ∈ S), at time h for an application type a.

Figure 5.1: Architecture of the GDC powered by multiple green energy sources

Numerical Results

TCO Comparison
Impact of Failure Percentage
Impact of Demand
Impact of Latency
Sensitivity Analysis

In this experiment, we investigated the impact of varying the error rate on TCO while meeting the green usage limit. However, TCO with the GACED model increases with demand due to the green energy consumption constraint. In this experiment, we investigated the effect of relaxing the bound latency on TCO.

Table 5.1 reports on-site and off-site renewable energy sources at each location with the corresponding average capacity factor (CF), defined as the ratio of the actual power output to the maximum rated capacity.

Conclusion

The proposed model optimally plans the demand taking into account the availability of green energy and its price change to minimize TCO. We conclude that with an appropriate model, green energy integration lowers the design cost of fault-tolerant GDCs with reduced carbon footprints. Therefore, we expect our work to help data center operators make informed capacity planning decisions in the presence of the green energy usage objective and variation in electricity prices, demand, and failure rates.

Figure 5.9: Percentage gain in the TCO with GACED compared to CED-B considering reduced cost of green energy after 10 years

Introduction

Load balancing and resource management in distributed systems have been addressed by non-cooperative game theory in previous works such as [81] and [82]. The problem is formulated as a non-cooperative game between users who try to minimize the expected response time of their own tasks. For the first time, we model load balancing in GDCs as a non-cooperative game between front-end brokers.

System Model

Data center energy consumption includes three components: power consumed by idle servers, given by ms(Pidle); power consumed by servers operating at utilization η given by ms(Ppeak − Pidle)η;. 6.7) Given the load λsu and the price of electricity per unit ρs, the cost arising from the consumed energy in the data center (also called operating cost in this chapter) is given by. The cost incurred due to the delay in the request λsu of the client region u in the data center s, denoted by ∆su, is given by .

Load Balancing as a Non-cooperative Game

The Nash equilibrium of the above-mentioned load-balancing game is a load-balancing strategy λ, such that for each front-end proxy u. The following statement defines player's best answer strategy, i.e. the solution to the best answer. Feasibility: Among the three constraints of the optimization framework, we see that the stability (Eq. 6.25) is always satisfied by the Nash equilibrium solution due to Eq. 6.21) and the fact that the total computing capacity of the data center is greater than the cumulative customer demand.

A Distributed Load Balancing Algorithm

Numerical Results

System Setup
Results
Impact of Demand
Impact of β
Client Latency
Convergence of NCG Algorithm

Demand from different customer locations is proportional to the number of Internet users from that region [67]. We vary the number of data centers in the system from 6 to 10 and investigate their impact on costs and fairness. In order to determine the number of iterations required for NCG convergence, we consider two scenarios.

Summary

Introduction

A cost-aware load balancing strategy operating in the presence of failures should select a new data center for request rerouting considering the renewable energy consumption requirements, electricity costs, and QoS requirements. For a scalable load balancing system, we propose a data center-initiated, distributed load balancing strategy to satisfy post-failover QoS requirements while minimizing the operating cost. We model the problem of load balancing in fault-tolerant data centers using linear programming (LP) to optimize both the cost of energy consumption and to minimize the client latency even after failure.

Problem Formulation

System Model
Optimization Model

Question: We define λufs as the number of requests mapped from customer region u to data centers. We assume that among ms servers in a data center, mfs is activated after the fifth data center fails. Φufsi indicates the workload of customer region u in data centers served by consuming energy type i, after the fifth data center failed.

Figure 7.1: The system architecture used for load balancing

Distributed Load Balancing Algorithms

Shift Workload Algorithm
Request Re-routing Algorithm
Time Complexity Analysis

Then, the workload to be migrated to other data centers is estimated using Eq. Finally, the data centers that could handle the load of a failed data center are determined based on the cost of power consumption. All demand previously served by a failed data center is allocated to servers in the remaining data centers.

Figure 7.2: Illustration of min-cost network flow model for P2

Numerical Results

Experimental Setup
Results

The performance metric considered is the normalized average energy cost relative to the peak energy cost in both solutions for a given scenario. For each hour, we distribute the demand among different customer locations in proportion to the number of Internet users at each location [67]. It can be seen that as the number of data centers increases, energy costs decrease due to more options available to take advantage of demand multiplexing and power price variation.

Conclusion

Wang, “Limiting the electricity costs of cloud-scale data centers with impact on power markets,” in Proc. Neely, “Data Center Power Reduction: A Dual Time Scale Approach for Delay Tolerant Workloads,” in Proc. Marwah, “Minimizing data center SLA violations and power consumption through hybrid resource provisioning,” in Proc.