Thursday, September 17, 2009

Disaster Recovery: An Insight


Disaster Recovery is subset of a business continuity plan of an organisation ,where business users set thier Recovery Point Objective (RPO) and Recovery Time Objective(RPO) . In order to achieve the desired RPO and RTO an organisation would need the Six important components Compute,Networks,Storage, Softwares , people and a Disaster Recovery site to help achieve the RTO and RPO requirments. Let us look at what is RPO and RTO.
  • RPO : Recovery Point Objective is the point in time of the ability to roll back to the last know Point in Time copy .Ex: let us consider the Asynchronous replication of Database was done at 11.00 PM and 11.30 PM a disaster occured ,the 30 min of data is RPO for customer.Typically customers will have continuous replication to DR site to reduce the RPO and other ways are to restore from redo log and most recent archive log at the primary site aswell.In case of Synchronous replication the RPO is zero seconds as the primary host acknowledges the I/o after committing to the target site .
  • RTO : Recovery Time Objective the time taken to resume or restart the services at the DR site from the time of disaster occurred. Ex : Data centre wide fire caused at 8 AM of a particular business day and when you failover to the Disaster Recovery site and bringup servers ,applications and storage to perform operations at 10 AM same day .This translates into an RTO of 2 hours.

  • Let us get familiar with when do we declare a disaster.
  • Compute System crash
  • Application crash scenario
  • Hard Disk crash scenario
  • Network failure scenario
  • Data centre failure scenario
  • Natural disaster ( earthquake, fire, Tsunami etc )
  • Power outage scenario
Please Note : It is not suggested always move to DR site for operations say you have an RTO of 2 hrs SLA with customer ,say for ex OS crash can be countered rebuild OS in 1 hr and bring the server up you are anyways providing customer with his business up and running within the RTO requirments.It is ideal to fail over to DR site when you see that the Data center infrastructure will not come up or meet the RTO requirements as of the business ,only then it is ideal to commence operations from DR site.

Where should be the DR site ?
  • DR site should not be in the same geography or same city unless you are doing synchronous replication using dark fiber.Again this is a Near site for the business and not a true DR site.
  • DC and DR site power should be fed by different power grid and if possible it is fed from separate power plant also
  • DR site should be close enough for the recovery members to reach in event of disaster occurs.
  • DR site should be in Tier 1 cities preferably because the majority of principal vendors are in Tier 1 cities and can deliver 2 hrs onsite response and spare turnaround of 2 -4 hours.
  • DR site Should be in Tier 1 cities for the only reason for the IT infrastructure companies and service providers can give quick spare turnaround and manpower as well.
  • DR site should not be in tier II cities because if it is tier two cities typically the turnaround time on travel of engineer onsite, identifying failure parts and replacing them would consume lot of time and may not meet the RPO and RTO requirements.
  • DR site should not be typically in the same city and same building, until an unless you want to have Synchronous replication solution using dark fibre near line site . Because in event of disaster both sites will not be able to commence IT operation ex :no power fed by the city power grid distribution or earthquake across the region.
How to approach a DR site solution ?
  • Identify the applications and Databases which are required to be replicated to DR site
  • Classify each applications and databases in terms of RPO and RTO Requirements.
  • Each application and databases might have different level of importance or critical to customer.For ex : Core banking application for an Online banking facility even a downtime of 10 minutes might cause loss of millions of rupees because account holders annot transact money.Where as for the same customer if exchange is down for 1 hour,it will not impact same amount loss to the Bank i.e. it will not be same degree of business loss.
  • basis the above study you can choose to do a Storage based or Host based replication
Which replication method should One choose ?

It is pure economics to decide on which replication method would be better for a particular business.It also extends on how long a business can survive without IT operations , and how much of business loss is triggered when their is a disaster occurrence to come up.For Ex : Banks with core Banking solution ( Tier 1 application)in place, a downtime of 1 business day caused by disaster occurrence will result in loss of lakhs to millions of dollar depending on nature of products and services offered by the Bank.But with the same Bank with exchange( Tier2 application) mailbox users of 3000+ is down for 1 business day ,might cause very minimal effect on business revenue and business operations.

So in summary One needs to understand four important points
  • RPO and RTO requirements of each application in customer environment
  • Classify less and high business impact applications
  • Once classified less and high business choose cost effective solution for each application.
  • Basis which decide on different replication methods like Host based replication, Storage based replication, even ftp for few small applications.
Replication Types :
  1. Host based Replication : This method uses the host server resources for replicating the data to DR site.The replication software is installed on the host server from which the data is to be replicated.
  2. Storage Based Replication : This method doesn't use any host server resources for replicating to DR site.The replication software is installed on the storage system, hence time stamping,log size ,tracking delta changes etc is performed by storage itself.
Let us look at the Host based replication architecture.


In the above architecture for example perspective let us consider we are using 2 Database server with Windows 2003 OS with oracle 11g running on it.

Let us look at the architecture workflow.
  1. Servers are connected to IP ports( either 1 Gbps or 10 Gbps) to the LAN switch .
  2. Servers are configured as Active /Passive or Active /Active based on configuration.
  3. Connection is established to storage using FC HBA( FC SAN) or Ethernet ports ( iSCSi SAN).
  4. One or more LAN Switch ports (for redundancy) can be connected to the primary site WAN router.
  5. Then the connection between Primary site WAN router and DR site WAN router is established.
  6. The replication methodology used is either Synchronous or Asynchronous replication in case of host based replication
  7. DR site WAN router then through LAN switch communicates to server.
  8. Server then writes to the storage.
Let us consider 1 + 1 Active /Passive HA configuration of entire infrastructure and list the IT components required for Host based replication for a SAP three tier Architecture.
  • Physical servers with RISC or x86 cpus (3 tier architecture i.e. Web, App, DB server - 2 node for each tier each for Active/Passive config)
  • OS licences of standard or Enterprise as per the architecture and OS platform chosen
  • Network Switches -2 no's for Network high availability i.e. each node will use two Ethernet ports and 1 connection to switch # 1 ( Active switch) and 1 connection to Switch #2 ( passive switch)
  • Network cables : 6 no's of cat 6 cables for 3 physical server
  • Each server would need 2 FC HBA ( single or dual port) and 2 FC LC to LC cables per server.
  • SAN switches : 2 no's for FC SAN switches for SAN high availability i.e. each node will use two FC HBA ports on the server with 1 connection to switch # 1 ( Active switch) and 1 connection to Switch #2 ( passive switch)
  • One Physical storage with minimum of Four fronted 4 Gbps FC connection( most of storage vendors have standard 4 FC ports these days )
  • WAN connection : Two WAN Ports per WAN router for High availability connection from each network switch
  • Leased line connection from one site to the other for replication.

Types of DR site :
  • 2 way DR solution ( DC to DR replication either synchronous or Asynchronous )
  • 3 way DR Solution ( DC to Nearline site replication in synchronous mode and DC to DR Asynchronous mode )