As any output-oriented person will know, losing vital services that can hinder core business processes is costly, to both the top line and the bottom line. Imagine not being able to pro- cess hundreds of transactions simply because there is no power; millions in earnings would be lost in a short period. That affects the top line, and with costs remaining the same or even growing, the bottom line is heavily affected as well.
IT has become so ingrained into modern business that all industries—financial, bank- ing, telecommunications, ISP, cloud services, and others—have relied on data centers, mak- ing them evolve into a monetized commodity and no longer just a support system for core business functions. The data center has become essential; many companies no longer use it to simply support the internal organization but have customers who pay premiums for access to a variety of IT applications and services.
Because of this reliance on IT systems, the connection between TCO and data center availability has become stronger than ever. A single downtime event has now the potential to severely impact the profitability of a company, or even worse its viability to future cus- tomers. Availability and uptime have become such a given and so essential to customers that downtime is now seen as almost unforgiveable. This is because, as mentioned earlier, they do not understand the complexity and work involved with keeping data centers online and available. Unfortunately, it seems that a lot of executive-level personnel do not under- stand this as well. They do not understand the frequency and cost of data center downtime, which severely impacts the way they view optimization efforts that can save money in the long run simply because of capital costs.
A study conducted by Emerson Network Power and Ponemon Institute in 2011 showed the differences in perception between C-suite or C-level executives and the rank-and-file IT staff. The study found that 71 percent of senior executives believe that their company’s business model depends heavily on its data centers in order to generate revenue, while only 58 percent of the rank-and-file IT staff does. This shows a huge disconnect between how executives and the IT staff look at the value of the data center. Because of this trend, IT staff might view their responsibilities in a lesser light and might not perform as optimally as possible.
It is often a good idea to follow industry white papers and research just to keep up with technology trends . Who knows what precious little gems you might find .
Technical Basics of Cloud and Scalable Computing 73
The same study showed that 62 percent of senior-level respondents believed that unplanned outages happened only sparingly, while 41 percent of the rank and file agreed.
This means that most IT staff members know that unplanned outages are happening fre- quently, even though they might not be severe and total ones. This data shows that C-level executives understand the economic importance of their company’s data center operations, which is not surprising given that it is their core responsibility to understand all the resources and economic facets of their organization and put them to good use. However, it also indi- cates that these same executives are not as aware of the day-to-day operations of the data centers compared to those that are assigned with maintaining the organization’s IT infra- structure. And as such, they are not aware of the frequency of downtime and other vulner- abilities of their IT infrastructure that may be contributing to these events. And because the rank-and-file IT staff may be unaware of the huge role of the data center in contributing to the organization’s economic standing, they would probably not be proactive enough to report and recommend improvements in the infrastructure.
Downtime has associated costs no matter how short. They don’t happen because some- one just flipped the off switch for a second or two. A downtime always has a cause, and more often than not, this cause is a hole that needs to be patched up, which translates to spending money.
Quantifying Downtime Costs
According to the study, the average cost of downtime is $5,600 per minute, and based on an average reported downtime length of 90 minutes, the total is a whopping cost of
$505,500 per downtime event. The calculated costs are a combination of many factors:
■
■ Corruption and data loss
■
■ Equipment damage
■
■ Productivity loss
■
■ Root-cause identification and recovery
■
■ Legal repercussions
■
■ Revenue loss dues to failed transactions
■
■ Even long-term effects on reputation
Direct costs accounted for only about a third of the perceived costs; indirect costs and opportunity loss make up 62 percent of the reported costs. What’s surprising is that about only $9,000 is attributed to equipment costs per downtime event, which means that intan- gible costs down the road far outweighs the cost of maintenance after a downtime event.
A typical data center in the United States experiences an average of two downtime events over the course of two years. So the costs incurred by downtime events can easily go beyond $1 million in less than two years. These downtime events are not even limited to total outages but include low-level factors and per-rack outages. For enterprises that depend solely on their data centers to deliver IT and networking services for revenue, like cloud service providers, telecommunications providers, and e-commerce organizations, the cost of a single downtime event can be more costly. The study showed it to be over $1 million
per event, which is more than $11,000 per minute. The study mentions that the most recent downtime events of the 41 participating data centers totaled $20,735,602. That is such a substantial loss for an event that lasts for only an hour.
Infrastructure Vulnerability Cost
Downtime events occur for a reason, and most of them can be prevented. Catastrophic events are the least of a data center’s worries. They are something that no one can really prepare for, only mitigate. The largest cause of data center downtime would be related to power. In fact, about 39 percent of the time the power infrastructure will be the cause of a costly downtime event. That is why it is important to nail down the power infrastructure during the planning process, which includes deciding on the location of data center itself and determining whether there are adequately reliable power utility companies serving in the area.
Of the data center failures related to power, UPS failures seem to be the costliest at around $687,700, and generator failure is a close second at $463,890. This proves to be the most costly reason for data center downtime. A power failure, especially in the UPS and backup generators, would mean total and catastrophic data center outage. Lights out!
It would cause indirect costs (such as lost opportunities) to pile up on top of direct costs related to maintenance and repairs. The opportunity loss is a big one because it affects all stakeholders and not a just a portion of the customer base of the data center.
Most tier I and II data centers typically do not implement the technology to isolate a power system failure. So making investments to minimize the impact of a power failure to your data center is the most relevant course of action. There is no lean process improve- ment or cuts that can be done here, only capital expenditure on solutions like power redun- dancy, meaning backups, backups, and more backups. Implementing redundancy would mean that operators have the capacity to eliminate single points of failure in the facility’s power infrastructure.
Equipment fails over time, but with a reasonable amount of redundancy, there should always be a backup that would kick in in times of need. The data center can run on its backups while the failed main module is being replaced, thus preventing a total cata- strophic failure. Direct costs would still be incurred, of course, for repair and replacement of failed power modules or systems. One form of redundancy is having multiple power paths, but this would also mean multiple grid providers, which is a commodity not avail- able in all areas.
Adding UPS systems for redundancy should be on top of the list of solutions. Long-term reliability of the UPS system should be the key element in determining what sort of UPS system to implement. By implementing input current tolerances, battery monitoring, and adequate service maintenance, you can ensure that the mean time between failures (MTBF) of a UPS unit can be maximized. Finally, integrating a comprehensive infrastructure moni- toring tool can help operators isolate and rectify power infrastructure issues and prevent total catastrophic failure of a data center due to a power outage.
Technical Basics of Cloud and Scalable Computing 75
Amazon EC2 Outage
On June 29, 2012, users of the Internet had firsthand experience with downtime brought about by infrastructure vulnerability . By 11:21 p .m . EST, a large portion of the Internet was offline because of massive power outages brought on by a severe thunderstorm in North Virginia, where one of Amazon’s biggest EC2 data centers is located . This affected only a single availability zone, but that zone contained widely popular media and social media sites, including Netflix, Instagram, and Pinterest . There were other sites affected as well, but the outage was immediately sensationalized because of the popularity of the afore- mentioned sites . Amazon reported to have resolved majority of the issues at 1:15 a .m ., and all affected sites were brought back up a few hours later . Overall, the outage lasted less than 12 hours, but the amount of attention it garnered was staggering .
Since Amazon cloud services reach every corner of the Internet, when they fail, everyone feels it . But what about the promise of no downtime with cloud computing? Well, that promise is only a guarantee and needs to be explicitly applied to the uploaded application or website by the application or website provider . Amazon makes it easy for users to run their AWS workloads across availability zones and provides various redundancy measures . This simply means that the affected websites did not apply an important feature of cloud computing . And it shows that users are still not utilizing the cloud to its full potential .
Outages Caused Due to Environmental Reasons
Environmental issues are the leading causes of IT equipment failures. In fact, among the root causes of downtime found by the Emerson study, 15 percent of all the root causes can be attributed to an environmental variable like thermal issues and water incursion. Detection and recovery from these failures also incurred significant costs, at an average of more than
$489,000 per incident. And when these environmental issues cause real equipment failure, it resulted in the highest overall cost, at more than $750,000, because expensive equipment has to be replaced in addition to the cost of man power and further downtime associated with the procedure.
The problem with environmental issues is that they can cause a chain reaction of IT equipment failures, which would require extensive efforts for detection and recovery of the issue that caused the outage, not to mention replacement of the equipment. The fact that cooling equipment does not even need to fail to cause an IT equipment failure is wor- risome. It shows a deeper problem within the cooling infrastructure itself. These isolated failures, which are typically caused by hotspots within the racks themselves, are often the result of inadequacies of the cooling infrastructure rather than a cooling equipment failure.
The failure was eventually going to happen; it was only a matter of time. And if this goes unaddressed, it will happen again in the same location, with the replacement equipment.
This just further illustrates the importance of an optimized and well-designed cooling infrastructure.
Even though environment- and cooling-related incidents are more iso- lated than power-related failures, an optimized cooling infrastructure with remote monitoring is still required to keep IT equipment performing at its optimum and to make use of it throughout its full, rated life span, therefore minimizing environment-related failures .
Fortunately all of this can be avoided by not skimping on the cooling infrastructure dur- ing design and construction, but if it is too late for that, upgrades or installation of a better system can help, although that may be a bit costly.
Here are some best practices and investments that can be applied to a data center with less-than-adequate cooling solutions:
■
■ Bring precision cooling closer to the load through row-based precision cooling solu- tions to eliminate rack hotspots.
■
■ Prevent water incursion by using refrigerant-based solutions rather than water-based ones.
■
■ Install comprehensive remote monitoring facilities that have temperature triggers and thresholds to warn operators of possible cooling issues.
■
■ Do regular preventive maintenance, and do not wait for something to fail before repair or replacement. This could mean the difference between catastrophic failure and no failure at all.
Though these are rather basic and common-sense solutions, you will be surprised how often they are put aside to cater to an executive’s idea of saving money in the short term.
Foresight is always more important than patching symptoms of current problems.