Business Continuity is a Top Concern
It’s no wonder that BC/DR planning is getting more attention. We remember the outages and financial loss that occurred from disasters ranging from floods, to tornadoes and hurricanes to the tsunami in Japan. You have probably seen the statistics warning that 75% without business continuity plans fail within three years of a disaster, and 43% with no emergency plan never reopen. Government regulations have also dramatically increased the Data Replication and compliance requirements. These situations have increased the awareness of the need to maintain productivity within a company, sustain value chain relationships, deliver continued services to customers and partners, all of which can be difficult if we are moving applications and user connections to a new data center location.
What BC/DR is really all About
Business Continuity (BC) is the ability to maintain operations and service in the face of a disruptive event. BC requires the availability of computing, application service, physical network access and network services, as well as user/client access to this infrastructure. The technology and infrastructure to maintain continuity can include virtualization, clustering, failover, failback, server hardware, network and network services, remote data center facilities, replication services, and redundant shared storage. With BC, the failover of a service is measured in seconds or less. Backup technologies cannot provide this level of continuity of service.
Disaster Recovery (DR) is a broad concept that can include recovery of people, facilities. Disaster recovery requires extensive manual methods to bring the IT environment up to an operational state. For example, service need to be reestablished, servers rebuilt, applications reinstalled, and data restoration completed. Infrastructure used for DR can include virtualization, server hardware, network and networking services, remote datacenter facilities, replication, backup to disk, backup to tap, and tape vaulting. DR is typically measured in hours or days. However, some well-designed DR plans incorporate highly automated failover to designated disaster recovery hot sites. As a result, the recovery can be achieved far more quickly.
It’s All about Reducing RTO and RPO
With their BC/DR plan organizations are looking to improve their RTO and RPO. Recovery Time Objective (RTO) is the period of time within which systems, applications, or functions must be recovered after an outage. This can be measured in minutes, hours, or days, depending on the criticality of the resource. Do all resources have the same RTO? The answer is no. If the primary data center goes down, access to business-critical databases tends to be far more important than users’ home directories.
Recovery Point Objective (RPO) represents the allowed age of the data when restoration occurs. RPOs are assigned on a data criticality level and come down to how much recently created or modified data an organization can tolerate losing. Shorter RPOs can be achieved by using continuous or near-continuous data protection technologies. A disaster recovery solution is designed to meet the RTO and RPO requirements and as we reviewed it can vary based on the criticality of the application as well as data for business continuity.
The System Protection Continuum
To deal with this need to replicate data to maintain business continuity a number of data backup schemes have developed over the years along a continuum. The continuum of data protection schemes offers protection at various levels of recovery, recovery speed and levels of data loss. At the origin the RTO and RPO are least stringent, and some data might never be recovered. As we move to the far right side the RTO and RPO become zero. This is the most stringent scenario that enables zero data loss and instantaneous recovery. At this point the cost/system value is also very high. As one moves up this continuum, we see asynchronous and synchronous replicated data centers. And we get to where we are now with the desire for near real time replication and the goal to have active / active data centers.
Choosing the Right Replication
You have two basic choices for Replication, synchronous and asynchronous and the network impacts them differently. With synchronous replication a write acknowledgment to application is only provided after data is synchronized to remote disk. This helps to achieve a zero loss/zero downtime RPO/RTO. However, as data replication is in the critical path, the latency of network communication is critical and is usually limited to remote sites within certain distance. Also network bandwidth availability issues can impact application performance and extended bandwidth limitations can create issues with synchronous replication and may require intervention.
With asynchronous replication, the data replication is handled offline, the application can continue as soon as write is acknowledged from local data store, and doesn’t wait for write ack from remote site. This enables deployment of a DR solution without negative impact from network latency performance issues on applications. The DR solution across long distance typically uses asynchronous replication. The network performance does impact the RPO, as the replicated data target lags behind the source. Any network issue like bandwidth availability or increased latency can increase the RPO and hence negatively impact RTO.
Improving Application Availability Levels
The goal of these replication schemes is to get to the desired application availability levels and the focus on BC/DR planning has been on how we get to high levels of availability of applications. Continuous availability or active-active data centers is the new industry goal where applications in the DR site can be used to load balance traffic, in addition to behaving as a DR site. Much of the focus in getting to this state has been on the wide area network and the need to do L2 stretch to better enable virtual machine mobility. However I think the question is what is really needed to ensure that we meet our goals of maintaining productivity while not compromising security, continuing to deliver service to our customers and meeting government compliance mandates.
The Answer is a Comprehensive View of BC/DR
I’m going to propose that the answer is more than data replication and more than L2 stretch and more than having active/active data centers. We need to take a more comprehensive look at building blocks for a robust BC-DR solution. We need to look at application resources, user access, and security and the network elements. Yes we need to maintain availability of applications, but we also need to ensure that users can reach those applications. We need to ensure that access to the applications and application flows are secure. We also need to simplify and tune the network to ensure application performance. How we do this was covered in a webinar on October 25, see the recording, Key Considerations in Designing for Data Center Business Continuity.