首页    期刊浏览 2025年02月19日 星期三
登录注册

文章基本信息

  • 标题:When disasters strike distributed systems - disaster recovery planning - includes product directory - Buyers Guide
  • 作者:Mary Hanna
  • 期刊名称:Software Magazine
  • 出版年度:1995
  • 卷号:Sept 1995
  • 出版社:Rockport Custom Publishing, LLC

When disasters strike distributed systems - disaster recovery planning - includes product directory - Buyers Guide

Mary Hanna

January 16, 1995 began as a typical Monday for Texas Instruments' worldwide command center in Dallas. However, all that changed at 2:51 p.m. CST when the center's network administrators discovered they had lost their communications link with KTI Semiconductor Inc. in Nishiwaki, Japan. (KTI, a manufacturer of computer memory chips, is jointly owned by TI and Kobe Steel.) Within an hour, TI's contingency and disaster recovery planning team learned why - an earthquake measuring 7.2 on the Richter scale had struck Kobe, a city 300 miles outside of Tokyo.

A second TI computer center, in Miho, Japan, was not affected. That center was able to send word that KTI Semiconductor was in satisfactory condition, requiring only some equipment recalibration to return to full functionality. However, the Kobe-based KTI computer center, whose systems support the semiconductor plant's automated manufacturing process, had suffered significant damage. The region had lost electrical power and the power company estimated it would take seven days to restore it.

The ability to recover operations following a disaster takes careful planning, and perhaps a little luck. It was Ti's fortune to have both. After learning of the disaster, the company formed three recovery, teams: TI Japan, TI Dallas, and a team manned jointly by TI and IBM Japan personnel.

According to Ben Taylor, Ti's manager of contingency and disaster recovery planning, the Miho site had upgraded its computers just a month before the quake. While IS checked out the new computers, they kept the older IBM 3090 mainframe powered up, as backup. "It was a de facto hot site. IBM Japan provided us with several strings of Dasd and a 3725 communications controller," said Taylor. "By Wednesday night , all the computer hardware needed to support KTI's manufacturing site was available at the Miho computer center."

TI sent an additional team, experienced in disaster recovery, to Miho. In the U.S., another team began converting a U.S. manufacturing system recovery process; it was ready within 48 hours for the Miho team. By that time, rescue' teams had recovered 2,500 tapes from Kobe's operational tape library. and flown them to Miho. These tapes brought the database to within five hours of the time of the quake.

"Kobe's Dasd had been shipped to Miho by an overland route [that took over 3 days], and once it arrived, the team was able to retrieve data up to the exact moment the system went down, This saved $8 million [worth] of work in process," said Taylor.

By January 20, the Kobe system, now in Miho, had gone through initial program load, and communication with TI's global network was reestablished. The KTI plant was up and running just 4.5 days after the quake.

Given that systems availability is critical to the health, and even survival of a business, it would seem that disaster recovery plans would be mandatory for large corporations. After all, computer centers are vulnerable to a variety of influences, from natural catastrophes like the Kobe earthquake, to those caused by humans.

Nonetheless, according to ICR Survey Research Group, Media, Pa., only 62% of firms responding to to a recent survey had a disaster recover plan. The survey was commissioned by Comdisco Inc., Rosemont, Ill., which offers the ComPAS plan automation system, and Palindrome Corp. Napervile, Ill., which offers Prepare!, a disaster recovery planning tool for LAN environments.

As companies move their computing systems out of the safe confines of the glass house, they face marked vulnerability. Said Bruce Battjer, president of SunGard Planning Solutions Inc., the software division of SunGard Data Systems Inc., Wayne Pa., "There is no question that the advent of distributed systems has increased the scope of disaster recovery planning efforts. Companies want to protect the entire enterprise, not just the central mainframe."

According to Battjer, disaster planning needs vary because companies are vulnerable to different kinds of disasters. A terrorist attack is unpredictable. Since it can happen anywhere, most firms should consider the possibility of such attack and plan accordingly,' he said. On the other hand, "disasters like hurricanes generally occur in predictable places, along seacoasts, for example. Companies threatened by catastrophes that can cause widespread damage may want to recover their systems a safe distance away from the site of the disaster."

SunGard offers planning sotfware called Comprehensive Business Recovery (CBR). "CBR is an expert systems tool that provides model recovery. procedures and plans, which can be customized as needed," said Battjer. "Among the key, pieces of information kept in the plan are lists of employees, vendors and customers, data concerning backup tapes, and the minimum [acceptable] recovery configuration of systems."

Some companies facing disaster recovery, he said, place an unnecessary burden on employees by focusing on all systems inventory. "We recommend that disaster recovery planners focus only on what is critical for the recovery location,' said Battjer.

TI's Taylor agrees: "Not all business activities are critical. And not all systems require the same level of planning." According to Taylor, TI has a steering committee that assesses the importance of various systems in a recovery effort. These fall into three categories: A "critical" system must be restored within 48 hours- a "secondary" system, within 14 days; and all inconvenience" system need not be restored.

Taylor's group uses SunGard's Total Recovery Planning tool (TRPS) for computer center disaster recovery planning, and the Living Disaster Recovery Planning System (LDRPS) from Strohl Systems, King of Prussia, Pa., for business systems recovery.

Recovery Begins at Home

Disaster recovery plans differ, and often depend on, the size of the company. Smaller firms may eschew packaged disaster planning software in favor of in-house solutions that rely heavily on backup. One firm going the in-house route is Marshall & Sterling Corp., an insurance agency in Poughkeepsie, N.Y. Bill Ommerborn, information systems manager, developed his backup and recovery plans based on articles and seminars.

The agency's 285 users, running DOS and Windows, are scattered throughout six sites, connected by seven Novell networks. During the day, the Network Custodian PRO System from Network Custodian Inc., Pawling, N.Y., backs up data using mirroring and duplexing techniques. With mirroring, parallel databases arc synchronized with realtime updates. Typically, these databases are physically separated and can provide data recovery in a disaster. With duplexing, data is written to two hard drives simultaneously.

"Our recovery planning consists mainly of our backups. We back up our files nightly, rotating the backup tapes weekly," said Ommerborn. "Month- and year-end backups arc stored offsite. Our equipment can be replaced within 24 hours by, our local hardware vendor. This is a satisfactory recovery plan for us."

Taking a similar approach is IDK Inc., a small software developer in Thousand Oaks, Calif. President Kevin McCoy said IDK uses Sytos Premium from Sytron Corp., Wesrborough, Mass., to back up the testing software they design for biopharmaceutical and financial planning firms. McCoy's main workstation contains over a gigabyte of information, and his server contains 300Mb. Driven by his proximity to the epicenter of the Northridge, Calif., earthquake, McCoy, copies everything to tape every night even though the effort takes more than two hours. "We would be out of business if we were down more than two days," he said.

Larger companies, as well, may opt for growing their own disaster planning solutions. The Stentor Resource Center Inc., Ottawa, for one, is a proponent of the independent approach, said Charles Wesley-James, manager of Service Management System (SMS). Stentor Resource Center is the R&D arm for Canada's 10 largest phone companies. Wesley-James is the primary disaster recovery planner for Enhanced 1-800 SMS, which handles the routing of all 1-800 numbers in Canada. SMS also lets the Resource Center, using Sun workstations, modify 1-800 numbers for sites throughout Canada.

"At any time, SMS is able to change where a 1-800 number is going in five minutes or less. Even if a site goes down, another site can handle the modifications," Wesley-James said. The software that handles routing functions was develop in-house. The database is built on SQL from Tandem Computers Inc., Cupertino, Calif.

"Outages in 1-800 service are not tolerated," added Wesley-James. "To ensure this performance level, backup of 1-800 number routing is instantaneous."

The Resource Center uses Tandem's Remote Duplicate Database Facility (RDF) to provide realtime backup of the 1-800 number database at a remote site five miles from the main building. It also maintains multiple lines in various underground areas, multiple power supplies, and backup air conditioning equipment. "Our backup and recovery decisions are clear-cut because they are government-mandated," said Wesley.-James.

MCI Telecommunication Corp., Washington, D.C., has similar performance requirements for its service. MCI has four large data centers - three dedicated to lines of business, and one dedicated to testing, development and service assurance. According to Russ Archibald, the Colorado Springs, Colo., director of service planning and management, this fourth data center provides backup for mission-critical applications written for IBM and Hitachi mainframes. Key applications written for distributed systems, including 1-800 and 1-900 call translations, have built-in redundancy.

MCI's disaster plan includes a process recovery team, which identifies mission-critical systems. These systems must be recovered within eight hours after a disaster.

Because the data in call processing and customer information systems is critical, "we use the Symmetrix Remote Data Facility (SRDF) [from EMC Corp., Hopkinton, Mass.] to mirror data between the data centers," said Archibald. "We switch the mirroring from site to site and can flip to the new site within 20 minutes.

"MCI has its recovery plans on paper and we test these plans at least once a year," he said. The plans include detailed system documentation, vendor lists and an escalation list, which identifies which managers need to be informed of outages based on seriousness.

Further, said Archibald, "Any, change to an application or system must go through a change management procedure, to ensure that recovery plans are still operable."

However, Karen Strohl, senior vice president of Strohl Systems, questions the advisability of using paper for disaster planning, purposes. Using a hard copy document as the recovery script would be very difficult for a company. Plans are not one-time projects. They must be easily maintainable and, most important, they must be readily available during the disaster," she said.

Strohl's LDPPS is a PC server-based application that runs under Microsoft Windows using a Microsoft Access database. LDRPS lets users coordinate information regarding key people, functions they have to perform, backup sites and necessary resources.

"The LDRPS package is loaded onto a server so that different departments can keep their plans up to date," said Strohl. "On a regular basis, the database can be downloaded to a notebook computer that can be kept off-site." Strohl Systems provides its customers with a TI 4000E series Notebook for this purpose, said Strohl.

Steve Otto, systems manager at Octel Corp., a voice-mail equipment manufacturer in Milpitas, Calif., is formalizing his disaster planning with LDRPS. The effort will ensure up-to-date backups, off-site storage for backup tapes, and duplicate equipment.

Octel recently lost an entire disk, containing all its data, when a disk drive malfunctioned. "We were able to get another physical disk drive from HP to replace the one on our HP 3000 that went had, After that, we were able to reload the whole disk from tapes that had been created by Unison's [Software Inc., Santa Clara, Calif.] high-speed Unix server backup product, RoadRunner."

Another option for companies lacking the necessary backup tapes to reload Dasd is to use a data recovery service. One such provider is Ontrack Data Recovery Minneapolis. "We assist other companies in their disaster recovery plans by recovering data from their Dasd in our laboratories," said Stuart Hanley, Ontrack's data recovery engineering manager.

Practice Makes Perfect

While disaster planning is critical to a successful recovery effort, it's a waste of time unless the plan works when the real disaster strikes. And the only way to make sure a plan can work is to execute it. Strohl at Strohl Systems strongly recommends that her customers give their plans a trial run. "Nestle USA [Glendale, Calif.] performs a full-scale test of its recovery plan every year at the insistence of its senior management. The simulated disaster is announced and everyone goes into recovery mode, following the plan exactly. Key people check into selected hotels and begin to execute the tasks described in the plan."

The Merchants Bank of New York City also believes in testing the efficacy of its plans. Tom Stackhouse, MIS director, recently ran a two-day test of Merchants' redirector, procedures using Wang's Mobile Recovery Center, a trailer housing eight workstations and operator consoles. For recovery operations, the truck can utilize hotel rooms, which have the required phone lines and conveniences such as air conditioning.

"We simulated a fire in one of the bank's branches on a Sunday night and informed Wang at that time. The trailer was set up at 6 a.m. Monday and by, noon, we were up and running," said Stackhouse. "We loaded our software and our prior day's data and our users started working, pulling in deposit reports data. In effect, the users performed acceptance testing.

"A post mortem, oil the test is planned and problems will be fixed," he added. "So far, the only difficulty was in matching our printer fonts with the recovery equipments." matching our printer fonts with the recovery equipment.

Stackhouse feels that Merchants' recovery test was successful because of the support it received from senior management. "The CEO was the head tester and he also led the review team. The head of branch operations and the operations manager were also involved in the planning and execution of the recovery test," he said. Such involvement will become increasingly important as distributed systems proliferate throughout enterprises. Disasters are a fact of life, and preparation will benefit from a commitment from all levels of a company.

Representative Disaster Recovery and Backup Tools

COPYRIGHT 1995 Wiesner Publications, Inc.
COPYRIGHT 2004 Gale Group

联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有