DISASTER RECOVERY RISK - Erik Banks and Richard Dunn

People need to work in a familiar environment. Disaster recovery can be defined as the process that needs to be followed when a company loses access to this working environment – head office, one or several floors of a large building, or a smaller branch office. The term “working environment” also needs to be defined broadly. For instance, the building may be accessible but none of the computers might be working as a result of loss of electricity, fire, sprinkler systems getting activated, etc. Disaster recovery is different from dealing with a temporary interruption of business, which we address in Section 5.4.

Conversion to the euro, followed by preparations for “Y2K”, heightened awareness of the need for robust planning for alternative working premises and infrastructure. Thankfully, though there were small “hiccups”, no large-scale activation of these plans was required. By contrast, the events of September 11, 2001 in New York required considerable disaster recovery action. The luckier organizations simply lost access to their premises temporarily, the less fortunate ones lost access to their premises for a prolonged period of time, others lost permanent access and some, tragically, most of their personnel. All of these events provide very fertile ground for examining what can go wrong and what can be done to overcome such problems.

On the morning of September 11 personnel had been evacuated from the Merrill Lynch head- quarters campus (adjacent to the World Trade Center Towers). These buildings housed executive management and investment banking personnel, the key debt and equity trading floors, as well as most of the supporting technology, finance, risk, research and middle-office groups.

The core operations functions were located in Jersey City (across the river from downtown Manhattan). Merrill Lynch’s data centers were split between a location further up on the west side of Manhattan and Staten Island. These two data centers served to back each other up in the

event of one going down. Two new buildings were at the point of coming on line – one in Jersey City and another close to Princeton. Merrill Lynch had also recently acquired a specialist in over-the-counter stock trading which had a trading ﬂoor located one train stop north of Jersey City.

The advent of Y2K had forced the organization to galvanize itself around a real event and much work had gone into producing lists of contacts, home and mobile telephone numbers, communication chains, inventory of technology programs, and so on. However, severe cost- cutting in 2001 had led to the disaster recovery planning budgets in the institutional businesses being halved. Partly as a result of this, on September 10 Merrill Lynch’s institutional business did not have an actionable disaster recovery plan to cope with what was about to happen. In its defense not many organizations had plans that were designed to accommodate the eventuality of two planes destroying the Twin Towers – thereby causing the whole of lower Manhattan to be inaccessible for several weeks and much of its infrastructure to be severely impaired.

On September 12 the operations group (responsible for processing, clearing and settling trades), being based in New Jersey, was essentially intact and ready for work. The equity division ended up squatting in the premises of the newly acquired over-the-counter specialist.

The debt division moved into the new building in Jersey City with technology, ﬁnance and corporate risk management personnel, whilst investment bankers ended up mainly working from home or the Princeton area. Even though people could be physically accommodated somehow, the primary challenge was to make them operational. This required rebuilding of large portions of the communications and technology infrastructure within and between the various Merrill premises and, more importantly, between these premises and the outside world.

The ﬁrm was now at the mercy of a handful of telecommunications and computer/telephonic hardware providers – who were not without problems of their own – and the people who could make this hardware operational.

In addition to restoring the ability for people to operate and communicate there was a need for traders, operational, ﬁnance and risk personnel to know their positions. Some businesses ran truly global infrastructures and were able to immediately switch to databases and personnel based in London or Asia. Most businesses, however, were running very fragmented technology.

Each business in the institutional division possessed its own software programs in support of its activities, and there were very few standards imposed across the entire division. Firm-wide transparency to positions and client dealings across the division were very manually intensive to construct.

Critical trading floor technology hardware and software in New York was located under each trading floor. With the appropriate communications technology it would have been possible to run this technology remotely from the data centers – however, the head-office building subsequently temporarily reached temperatures of over 140^◦F and all computers had to be shut down. In such an event one would usually turn to back-up tapes to reconstruct the positions.

Tapes backing up daily or weekly positions were being created. However, they were being kept under the trading ﬂoor, not removed and stored away from the premises as is common practice. Groups of personnel, led by the vice chairman of the company, were sent in under special authorization from the “Ground Zero” rescue effort to recover thousands of tapes. Once retained it was discovered that for some businesses the discipline of creating back-up tapes was weak and the information was over three months old!

Merrill Lynch thus ended up dependent on people working 24 hours a day, sieving through thousands of tapes. It took a full two weeks before all positions were retrieved and the ﬁnancial records could be re-established. In the meantime markets were very volatile and the ﬁrm was going into quarter-end reporting of earnings. Having learned the lessons of 1998, management

had fortunately instituted a new funding policy that lowered the firm’s dependency on the commercial paper market (and enabled it to fund itself for one year without needing to access the debt markets). Therefore, even if news of these difficulties had spread, Merrill Lynch’s liquidity profile was very secure.

Lehman Brothers, Bank of New York and countless others went through their own versions of this trauma. It would be equally interesting to study the speciﬁcs of each experience. To gain some ﬁrst-hand lessons on disaster recovery, we can consider what went wrong and what went right with the Merrill Lynch trauma.

What went wrong?

r

The underlying premise for disaster recovery did not match the incident.

r

Recovery plans were not actionable.

r

The communications infrastructure plans were not appropriate for the event.

r

Dependency on a few key suppliers and service providers was not well understood.

r

Back-up data tapes were not being stored off the premises.

r

Technology was very fragmented, with few ﬁrm-wide standards.

What went right?

r

Businesses were spread out – in particular, the operations functions and main data centers were not located with the debt and equity businesses, and therefore not affected.

r

The equity business possessed a “similar operating environment” in a separate location.

r

Merrill Lynch had surplus real estate in a condition to be quickly usable.

r

Merrill Lynch had “muscle” with the authorities and suppliers to be able to get help fast.

r

Merrill Lynch thankfully did not lose substantial numbers of critical personnel in the incident.

r

The quality of personnel was excellent: people were creative and worked very hard.

r

The company’s liquidity proﬁle was very secure.

Since it is not possible to plan for all disasters, how does one choose which to plan for? And how does one create disaster recovery rules that can easily be applied by some units but not others?

We feel that a ﬁrm should develop a handful of “common sense” loss control steps that provide a core level of protection for any disaster situation. It is very important that these steps be “actionable” rather than theoretical so that they protect and add value as soon as a crisis strikes. It is pointless to create a theoretical disaster recovery framework that is so complicated and involved that no one can follow it when needed! Examples of these loss control steps would include¹:

r

Diversifying businesses geographically.

r

Splitting front- and back-ofﬁce functions into different physical locations.

r

Segregating data center(s) from the main operating locations.

r

Copying sensitive and relevant electronic information to offsite data centers (every day, without fail!).

r

Archiving original documents offsite.

r

Replicating business origination and trading facilities offsite (on a smaller, but still useful, scale).

1Note that many of these can also be applied to smaller scale business interruptions, as described in the next section.

r

_Creating_simpleinstructions (with a deliberate emphasis on simple) on where key personnel should go, and whom they should contact, in the event of a disaster.

r

Working with, being aware of and testing supplier disaster recovery plans. This would include communications, payments, security, safe custody and so forth, and may necessitate diversiﬁcation of suppliers.

Ultimately, it is up to executive management and the board, working closely with its reg- ulators, to decide what type of disaster the company is trying to protect against. It is critical that each ﬁrm then prepares itself for such an eventuality by conducting quarterlyreal-time testsof the control steps mentioned above (e.g. testing daily data back-up and retrieval pro- cedures, migrating business, risk and ﬁnance personnel to offsite trading locations, having them conduct “mock trades”, and so forth). We strongly believe that there is a need to di- vulge the premise for this planning and for management to publicly sign off that it has funded and regularly tested the necessary working plans. Hopefully, since the unfortunate events of September 11, everyone is taking disaster recovery plans more seriously!

Disaster recovery risks can be broad-based, such as those based on September 11-type scenarios, or institution-specific. When disaster strikes many institutions at the same time, the possibility of systemic interruptions is heightened and the chance of much larger financial losses is far greater. When institution-specific, problems can still be serious but should be more contained – and much less destructive from a system-wide perspective. The market, however, may be less forgiving to that single organization in a single-institution event.

Dalam dokumen Erik Banks and Richard Dunn (Halaman 67-70)