CPU wait time is the time the CPU is waiting for a certain process to finish instead of pro- cessing something else. This is usually because it is waiting for an important portion of a current process it is working on, such as waiting for synchronization or I/O operations to finish. This is why I/O operations must be kept at a manageable level.
Impact of Configuration Changes
Configuration management is an important aspect of any system, especially one where downtime is not tolerated. Configuration drift and unauthorized configuration changes are said to account for more than 70 percent of IT service outages. Improper configuration also contributes to performance issues and performance degradation and impacts IT productiv- ity and budget as a result.
Impacts of Unauthorized Changes
Change management is essential to foster a consistent and robust system, and the first step is to ensure that all configuration changes are known, planned, and approved . What happens when this essential aspect is not addressed? Take, for example, a firmware development organization that develops firmware for automobile computers . The com- puters are connected to various sensors that provide them with information so that they can adjust various performance characteristics of the car in real time . The organization uses the C language and a compiler specifically configured for creating its firmware . Now it just so happens that the company is pushing for new features that need new sets of code that have not been used before . Unfortunately, the new implementation does not compile with the current compiler configuration . Believing that the code is already per- fect and that the problem lies with the compiler configuration not being compatible with the new code, the build engineer tweaks the compiler to allow it to finish building without notifying management . The code compiles and the firmware is delivered and functioning without a hitch . There are no problems until a new update is needed . Together with the older compilation, the newer code in the update compiles with the tweaked compiler, creating situations that cause the computer to malfunction .
It just so happened that the build engineer had already long ago left the company, so it takes a very long time for the developers to pinpoint the cause simply because the change the engineer made to the compiler was not documented .
Configuration tracking is an approach in which special software is employed to track the various settings and configuration used in the system that happen across virtual and physical environments. Understanding what sort of impact a certain change will bring is quite difficult sometimes, but it starts with authorization. If a change was not authorized, chances are it is an error or a breach may be occurring at that instance. But even though certain changes have been assessed and mandated, understanding the potential impact of those changes is still a constant challenge. It’s this area of uncertainty where IT services sometimes lose control, and this breeds an environment that is vulnerable to security breaches, service infractions, or inability to comply as well as causing outages.
An organization must have a firm grasp of the configuration of its system and must balance dynamic business needs and initiatives without losing control. The formula for doing this is found in the section “Configuration Management” in Chapter 4, “Cloud Management.”
Common Issues
In a system with hundreds of different components that range from physical hardware to various firmware and software applications that are essential to the system functioning properly, one of them is bound to eventually fail. Software is easy to fix through configura- tion or reinstallation, but hardware must always be totally replaced when it fails. Here are some common hardware failure issues.
Disk Failure Disk failure is one of the most common types of failure you will see in a data center environment, especially one with vast storage area networks (SANs) used to keep up with customer database demand. This one is common because the HDD is composed of mechanical parts that are very small and moving at extremely high speeds. Because of the size of the mechanical components, they cannot be made to be absolutely durable, so a common failure point for any hard drive is its mechanical part. The electronic components like the controllers can fail as well.
A disk that is about to fail will often have various obvious symptoms like bad sectors despite proper file system management and maintenance and slower I/O speeds than expected, or it will shut down and then restart. A drive could fail and then after reboot run again for a while. This is a sure sign that the drive will totally fail in the near future. Constant backups and employing RAID configurations can mitigate the resulting impact, but they can never stop the inevitability of a disk failure.
HBA Failure Host bus adapters are pretty much like network interface controllers; they are separate modules attached to the module slots of a computer system. The HBA serves to provide I/O processing as well as physical connectivity between a server and a storage device or cluster of devices. They serve to offload processing from the CPU so that it can concentrate on other matters.
Summary 153
This is a crucial point of failure because if a host bus adapter fails during important I/O operations, the data could be lost. Unfortunately, there is no exact way of telling when it fails because most electronic devices will just fail without warning. But some would exhibit signs such as intermittent disconnection with attached storage devices or I/O operations getting dropped or taking longer than normal.
Memory Failure Main memory is one of the core components of a computer system.
Everyone in IT knows that RAM performance is key to system performance and that mem- ory failure is not an option, at all. Disks can be configured for backup and redundancy, but there are no such options for memory. A memory failure can cause an entire system to crash because the memory module that failed may contain important data that is being used by the system or its components. A computer will not even start when there is a defective memory module attached to it. So it is imperative to always check a system’s memory, and signs of fail- ure must detected ahead of time to prevent costly and untimely downtime.
NIC Failure The network interface card (NIC) is a computer system’s gateway to the net- work and beyond. It is the main communication interface and important for a distributed system that is supposed to be accessible from anywhere in the world. However, it is also fault tolerant. Losing a NIC would mean losing connectivity, but that does not involve sys- tem failure. There would be network downtime and the server might not be accessible, but it is an easily containable and preventable failure through the use of NIC teaming/bonding or link aggregation. It is certainly not as fatal as memory and CPU failure.
CPU Failure The central processing unit (CPU) is the brain of the computer, hence the word central. CPU failure would mean utter and total failure. A CPU failure is one of the worst kinds of failure, in terms of cost and lost productivity, that can occur in your system.
It ensures total shutdown of the system, and most operations will be nonrecoverable. The CPU is also one of the most expensive parts of the system and one of the hardest to replace in terms of installation. Unlike a NIC, HBA, or disk, which can all simply be plugged into the board or various sockets, the CPU must be completely removed and then replaced.
Summary
This chapter is all about performance of the infrastructure rather than the virtualized environment of cloud computing. We focused more on the concepts of how most of the hardware parts can perform and fail. The most prominent of these parts is the disk drive, which is incidentally also the largest bottleneck of the system. The speed and performance of the disk drive has hardly improved since the 1990s, but the capacity and affordability of the technology has improved by leaps and bounds. So it is this relatively weak perfor- mance that we examined. For a disk, its key performance indicators are its access time and the data transfer rate. The access time is the time it takes for the mechanical parts to position the read/write head on top of the track and sector that contains the data it is look- ing for. Taken into account are the spindle speed, which rotates the disk and is measured
by revolutions per minute, and the actuator arm lateral movement speed, which brings the head over the correct track.
The file system also plays a major role in the performance of a storage system. It is important that the file system be configured specifically, tailor-made if you will, to your purpose, which is cloud and distributed computing. Common file systems used in this area are FFS and NFS. Most operating system users, especially Windows users, take the file system for granted because it is largely invisible to them, so it takes a back seat when enthu- siast modders and gamers tweak their computer’s performance. But for a service entity that deals almost exclusively in data, the file system is probably one of the most important fac- tors in the service infrastructure. This is where most of the performance in the system can be attained or lost because cloud applications are mostly data related and data dependent.
From the file system, you can control I/O operations and metadata operations, which helps to speed up various processes through proper tuning.
We also discussed various methods that help increase disk and memory performance, such as I/O tuning and disk swapping or paging.
Chapter Essentials
Performance Concepts Although a multitenant cloud system is a diverse system with many capabilities, all of them are tied to data, which is in turn stored in disks. When talk- ing about performance concepts, we often mean the criteria for which we judge a specific system or its constituents. In the case of cloud computing, it is usually the database and the storage system.
File system performance, I/O operations rate of processing, read and write operation speeds, and metadata performance are a few of these performance concepts that an IT professional, not just a cloud expert, should be familiar with.
Disk Performance The disk is kind of like the bread and butter of the enterprise. Since disk performance has not really improved by leaps and bounds in the last decade, we are left with finding the right drives from the right manufacturers by taking advantage of every microsec- ond of performance we can. We are looking for great performance in access time and data transfer rate, which are characterized by the internal mechanical parts of the drive.
Disk Tuning This is the optimization of a disk’s performance in conjunction with other hardware and the software platform that is used to manage it. Obviously you cannot make a disk perform any better or faster than it’s capable of, unlike processors, which can be overclocked. You can only make them perform at their best through proper configuration of settings and tuning the file system itself to work in the best way possible. You must take into account the actual performance of the disks being used, which is reduced to really just limiting I/O requests to the disks and doing batch file access and transfers because the disks are pretty much the biggest bottleneck in any system.
Swap Disk Space The swap disk space is essentially a specially assigned portion of the disk that is being used as a place to temporarily store data from the main memory, known
Chapter Essentials 155
as pages, so that it can free up memory space for more urgent applications. This is process is widely known as paging and that space is called the page file. The size of the page file is ideally set to 1 percent to 5 percent of the total disk capacity, or it can simply mimic the size of the memory itself, so if the system has 8 GB of memory, then the swap disk space should also be 8 GB.
I/O Tuning Moving away from the disks, because their performance is more or less con- sistent and hardware driven, we set our attentions to I/O operations, which by and large dictate the work that the disks do and to an extent the CPU wait times that we have to endure. I/O tuning involves working with the correct number of disks and memory as well as a lot of scheduling and prioritization. This includes setting buffer times between I/O calls to the disk. For this process, you must first test the actual performance of the disk in terms of read and write speeds and then factor that with the actual performance that is expected of the system, and that will determine the number of disks and the settings that will be put in place.
Performance Management and Monitoring Tools In any cloud service model, there are monitoring tools that keep administrators informed about every aspect of the system, whether it is for IaaS, with monitoring CPU and memory and disk usage levels on the infra- structure, or for SaaS, with it statistics on user and traffic level as well as application and virtual environment health. Some solutions cover the entire cloud, from the hardware infra- structure down to the individual guest OSs running on the system.
Hypervisor Configuration Best Practices The hypervisor is in many ways like an OS, but it is not as complex and simply serves as an abstraction layer between the underlying hardware infrastructure and the virtual environment. Since this layer serves to control both sides of the system, it is essential that all the settings here be just right. For example, the virtual machines that are running do not know they are virtual and assume that they are alone and that all computing resources they have are theirs alone, so they tend to go wild accessing them. It is the job of the hypervisor to, say, limit the number of I/O requests to the actual disk to prevent overusage by one particular guest OS.
Impact of Configuration Changes to the Virtual Environment As discussed in the previ- ous chapter, configuration management is important in keeping a system healthy and free of incompatibilities and security threats. But configuration changes have to be made even- tually, and understanding the consequences that those changes will be bring is the challeng- ing part. You also have to keep monitoring any changes being made to the system, in real time if possible, because unauthorized changes can lead to big problems such as vulnerabili- ties and even downtime. And in the event of downtime caused by an undocumented and unforeseen change, the root cause is often hard to track.
Common Issues As a conglomeration of various complicated parts, the computer system or infrastructure is a well-oiled machine that works well when all the parts are working and comes to a dead stop when one fails, despite the concept of modularity. In this case, modularity is geared toward physical assembly and not functionality because most of these parts are codependent; without one, the whole does not work correctly or at all. That said, common failure points are the disks (because they feature a lot of moving parts), the NIC, HBA, memory, and of course, the CPU.