Business Continuity and Disaster Recovery
Business Continuity and Disaster Recovery is a major initiative being undertaken at WNA
History
WNA has been developing Disaster Recovery and Business Continuity over many years. WNA's first DR project started in 2005. The objective was to replicate the IFS2002 ERP Database to a Co-lo site at Cincinnati Bell Hosting Center (CBTS). Using an application called Replistor, the IFS database was replicated real time to a EMC CX-3000 SAN. From there Replication Manager was using to restore the database to a fail-over Production Server. For connectivity to the CBTS WNA added a T-1.
Success was limited. Firstly replication was very slow despite the fact that the IFS incremental increase to the database was probably no more than 10-20mb a day. There were also limitations in the replication functionality. The most glaring was that the replication had to be stopped before logging into the database to verify the veracity of the data. There were also limitations to the effectiveness of failing over to the replicated server. There were no printing services and access to the server would be limited to no more than a skeleton number of users dedicated to entering Customer Orders, Applying Cash and entering Supplier Invoices. Also this was not a solution if WNA had a network failure. Finally we had no fail-back strategy.
With the implementation of IFS App 7 the solution became redundant as there were many more IFS Servers and utilities that would have been prohibitively expensive to replicate in this fashion effectively. It was decided to close down the co-lo site and reassess the DR/BC strategy after the IFS upgrade to App 7.
Lessons Learned
- We needed to take a more holistic approach to DR that provided solutions for scenarios such as Hardware/Software and Network failure
- We need to provide a DR solution that encompassed the whole WNA suite plus the newly implemented third party integated applications and SAAS services such as Radley, Transite, Intek, Mattec PMC, Syncada, etc
- We needed to provide seamless connectivity to users in the event of a failure and have a fail-back strategy
- It would require a significant investment to meet WNA's objectives.
A New Approach
We stripped back DR/BS back to its core as we undertook a full review of our DR capabilities. The first phase was to think in terms of Business Continuity, i.e. doing everything in our powers to prevent a disaster in the first place.
- The first step was to rebuild our DR Center.
- Secondly developing and rewriting all of operational procedures
- Thirdly - Rethinking our back routines as well as securing our back ups and restoring.
- Installing a power generator and replacing UPS's to ensure a smooth shutdown of our servers
- Keeping spares on site
- Ensuring that server warranties and server support contracts were up to date.
A Defining Moment
As (bad) luck would have it only 2 days after WNA went live on the upgraded IFS APP7, the Data Center was hit by a massive ice storm that cut power and network for three days. Fortunately it hit on Friday and WNA decided to close the Massachusetts site so the impact was limited to the other, smaller sites. The power was retored over the weekend but IFS was not brought back up until noon. It has suffered a hard crash but Oracle is a resiliant database and there no corruptions once it had restored itself. However, it was not until Wednesday that email was fully restored for all users. This added another requirement to our DR requirements, namely no disruption to email. See Mimecast for a explanation of how that was solved.
The reason why it was a defining moment was that the ice storm and disruption to IT services took place at the same time as a Board Meeting and the board dictated that disruptions like these should never happen to this extent again. It also meant that the board became more open to making the necessary investment to provide a successful solution.
The New DR/BS Plan
Introduction/Background
WNA utilizes a ERP application called IFS. This is has been implemented at WNA for seven years. All sites use the IFS software and over the years it has replaced many of the ad-hoc programs used at the sites. Consequently, IFS is the key enterprise application that supports the business. It is used to record sales, purchasing, shop floor, MRO and warehouse transactions. It records all the financial transactions and is used for the planning of MRP and MPS.
Over the years WNA has also taken the opportunity to integrate and interface IFS with selected ‘best of breed’ applications. Example of these include the Radley EDI application for the transmission of sales-orders and invoices, etc. The IFS application uses an Oracle database 10G2 running on a Linux Operating system. The IFS data is located on an EMC AX-4 SAN. There are number of other servers associated with the IFS hardware and application environment such as Extended Servers, Print Servers etc. These are a mixture of 32bit and 64bit Windows OS servers running on HP Blades.
WNA also has integrations with other applications within and outside the network. These interfaces include one to Production Monitoring System (Mattec) and another to a web-based Transportation Optimization application (Transite) Other interfaces are also in the works and will be deployed in the next 6 to 18 months. WNA entire Production environment as described above is replicated on a Development and Test environments, although we don’t utilize the SAN in Development or Test.
Users at Chelmsford and the other sites connect to the IFS ERP system a couple of ways. Most users run an Oracle client on their desktop. This connects to an Application server which serves up the IFS GUI Runtime application. The Application server sends and receives data from the IFS Database. Mobile users will connect to IFS using Citrix client on their laptop. This has an Oracle client and IFS Runtime that connects to the database. A visual representation of the IFS Infrastructure can be seen in the series of illustrations below
Explanation
- PROD Database server (64bit Linux)– The IFS application logic
- SAN – The IFS transactions
- Extended Server (64bit Windows) – IFS Print Agent/Help Files
- Print Server (32bit Windows) – IFS Print Server
- Utility Server (32bit Windows) Demand Planner Server and Business Modeler (Future replication requirement TBA)
- EDI Server (32bit Windows) EDI transactions interfaced with IFS (Phase 2)
- Application Server (32Bit Windows) One at each site. Users connect to the IFS Runtime through this server
- Citrix Farm (32bit Windows and Internet Gateway) Users connect over the internet through Citrix to the Application server
Note: There are TEST and Development Database Servers and Extended servers
Disaster Recovery Objective at WNA
The Primary objective of the DR project is to ensure the continued operation of the WNA business after an ‘event’ that prevents the users from accessing the IFS system. The cause of the ‘event’ can be classified under one of three categories. They are: 1. Corruption of the IFS data that prevents users from logging in and transacting data 2. Damage to the physical environment that prevents access to the IFS Application 3. An interruption of the network connectivity to the physical environment and the application.
This document will focus on the first two ‘events’. Faced with one of the two event scenarios above, WNA is building a DR solution to recover from and continue to operate with as little interruption to its business as possible. As a result we expect from a DR solution that:
- A. Selected users for all sites will be able to log on to alternative DR IFS environment within four to eight hours of a disaster
- B. The DR environment that the users will login to will be no more than four hours behind the Production environment. This means that they will only have to re-enter 4 hours worth of transactions. However, WNA believes this can be improved upon with the right tools
- C. The users will be able to print reports from IFS or Crystal reports that are within the IFS Runtime
- D. EDI transactions can still be processed (in Phase 2)
- E. The interface to Mattec (PMS system) is active (See later)
- F. The interface to Transite (TMS system) is active (See later)
- G. The Interface to the proposed WMS system is active in Phase 2
- H. The interface to the proposed AP system is active in Phase 2
- I. The interface to the Demand Planning server is active in Phase 2
- J. The users will access the DR environment through Citrix.
- K. IFS Event services that generate messages and emails will be active.
WNA will maintain a parallel ERP infrastructure in sync with the Production environment. It will do this by utilizing replication software such as DoubleTake and Goldengate to maintain a real-time replicated environment. DoubleTake will be utilized for all other systems except the Oracle database replication. Goldengate will be utilized for the Oracle replication. WNA will utilize a mixture of virtual and physical environments. However the Goldegate replication will only be physical to physical server replicating the Oracle database transactions.
A schematic diagram of the total DR solution that includes Goldengate and Oracle is shown below
There should be no inconsistency of Oracle data being replicated between two systems. This means that if an interruption takes place during replication the target server will roll back any transactions not fully transacted to avoid corruption of the IFS data.
In the event of our current product database being unavailable the IT Director, Data Center Manager and ERP Administrator would make the decision to use the Disaster Recovery solution. (Initially this DR solution will located in Chelmsford MA ); the same location as the existing database infrastructure. In future it could be located at a different facility or off premises completely. This is outside the scope of this phase)
The IT Department would make all the necessary changes to allow users to log in to the IFS application and have access to the IFS services such as the Print Servers and Event servers as outlined above. This could include changes to networking traffic, cabling, UPS’s, backup systems and media, Citrix, etc.
The IT department would allow only limited access to the DR servers in the result of switching over to the DR solution. After establishing that the DR infrastructure was available and the data intact, users would be told up to what point transactions are included in the DR server. Initially, missing transactions would be entered and verified then users would connect through the Citrix server to IFS. At a minimum WNA would allow selected users to access and enter Customer Orders and then Deliver and Invoice those transactions. The AR department would have access to post cash against open invoices. These departments would be able to print documentation as normal from IFS (Canned reports or Crystal reports using the IFS Report Navigator). IAL and views of the IFS data would continue to run on a daily basis. If possible (depending on performance, available Citrix licenses, etc) we would allow additional users to enter Purchasing and Shop Order transactions.
Interfaces to IFS
- Syncada. AP transactions would continue in the Powertrack system. Reconciling the two systems would be attempted only after the PROD database server comes back on line
- Transite. IFS would continue to send and receive shipping, delivery and costing information. This would be a high priority
- Mattec. IFS would continue to send SO information to the Mattec server. This would be a lower priority
- Adaptive Planning. No sales data would be sent to Adaptive Planning until the Production Database is restored
- Radley EDI. Where possible EDI transactions would take place manually or through a web portal with agreement with the customer. In Phase 2, EDI transactions will also be replicated to DR server.
- Intek. To be determined
Replicating/Backing Up the DR server
The DR server should act as the source server and replicate data back to the original Database server or new server as required. The DR server will need to have the necessary back up agents to do a Cold Back up as required. It is also possible that WNA would restore data from a tape backup (with Archive logs) from the DR server as a more practical solution. Failing back to the production server requires the temporary renaming of that server since the DR virtual server assumes the hostname. DoubleTake then replicates server changes back to the production server. Once all changes are present, the production server is restarted and resumes duties with its original hostname.
Additional Requirements
In order to maximize its investment in Goldengate WNA would like to replicate from the database server to the DR solution and the TEST server. (A ‘one to many’ scenario) By keeping the Test database up to date we will be able to respond more quickly to user’s problem. We will be able to replicate their issues using the actual data that caused the problem. We will be able to load and test patches using the same data involved in the original issue. Therefore in such a scenario the replication between the source and target Test server would be broken. The data in the Test server would change as testing and new patches are loaded. Once complete it should be possible to restart the replication from the point that the original replication between the servers was broken.
Hardware /Software Infrastructure to support the DR Replication
Item
- 2 Servers $22,800
- Vmware software $3,500
- Vmware installation and support (Jump Start) $9,000
- Double-Take software $18,000
- Double-Take installation and support $7,000
- Symantec Backup Exec for Vmware $1,100
- Labor (Estimate for CDW) $32,200
- Sub-total $93,600
- Other Vendors pricing
- Golden Gate Software $93,100
- Golden Gate installation $9,000
- Microsoft software (1 Std Server) $700
- Red Hat Software $1,300
- Sub-total $104,100
- Total $197,700
The table above reflects the total cost of the DR Replication solution
- 2 Servers – One Oracle/Linux database server. One Virtual server which will have all the others services loaded on it.
- VMware – The software that is loaded on the server to make it operate as a virtual software
- VMware installation and support – The cost to install the software plus support of VM in year one. (Estimate)
- Doubletake Software – Software that replicate all other IFS services except the Oracle database
- Doubletake Installation – Support and installation costs year one
- Symantec back up – Software to back up VMware servers
- Labor – Consultancy projects costs associated with delivering the overall solution.
- Goldengate – Software that replicated Oracle databases. Includes Year one support costs of $17,000
- Goldengate installation - Costs to install and test Goldengate (Estimate)
- Microsoft Software – Operating software loaded on Virtual server
- Red Hat – Operating software for Linux Database server
Implementation Steps & Project Implementation Timeline
Below is a list of DR Technical Installation Steps. This is a high level overview of the proposed solutions.
1. Selected vendor will help us setup the initial VMware vSphere 4 ESX hardware and software. (VMWare jump start?).
2. Double-Take will help us set up the Physical (Production) to Virtual (DR) replication of the Windows servers. The virtual servers start out as basic Windows 2003 Std servers with nothing installed. The Double-take software then has a client on each production server (small footprint) that send changes (Service packs and hotfixes included). The virtual servers will remain offline and a Double-take management server (also virtual) will collect updated changes from the production servers. It takes approximately 30 minutes for the virtual servers to boot up with all the changes applied. These servers will then take over the hostname and responsibilities of the production servers. This Double-take feature is called VRA (Virtual Recovery Assistant). When it is time to shift back to the production servers we will use feature called FSFO (Full Server Fail Over). We rename the production servers, and enable FSFO to acquire any changes made in the virtual servers while they were online. Once the production servers are up to date, we reboot them and they reassume their original names and responsibilities.
Note: there must be drive space available on the production server to accept the FSFO information which includes all System State configurations, Program Files directory and Windows Directory.
3. Two IP’s are assigned to each server. One for the Primary site (CHLM) and one for the DR site (CVG). The NICs with the local IP will be functional since they are in the proper IP range.
4. One Citrix Servers will be virtualized using VMA. The number of connections to DR will be limited both by Citrix license limitations and bandwidth.
5. WNA will stand up a new Red Hat Linux server with Oracle. This will be a physical not virtual server. Alternately, we can leverage Double-take to do a P2P(Physical to Physical) server build. This option will help us later on with not having to update the DR Linux server when any changes are applied in production. Oracle replication can be excluded by Dbl-Take since Golden Gate will handle Oracle traffic.
6. Oracle Golden Gate will help us setup the initial Oracle replication. This is easiest if a copy of the full database is already in place.
7. We will test at Chelmsford. The ability to test is built into VMA. Testing does not affect the production environment. Virtual servers go back into standby mode once the test is complete.
8. Once the testing is complete we anticipate moving the DR environment to Covington, KY unless an alternate location is selected.
The Project will take 90 days from approval of Cap-ex. The first 60 days will consist of the ordering, installation and training of the products and services. 30 days to test/troubleshoot and tweak the solution to a point where the solution is viable.
Risks and ‘Plan B’ options
- No risk to current ability to operate IFS or Back up IFS will result from implementation of solution
- Virtualization of servers. This is ‘solid state’ solution but no current internal knowledge of virtualization. Good training and documentation is required. Annual support is covered in the quote
- Goldengate and IFS. Goldengate is a premium product owned by Oracle and the only product Oracle certify to replicate Oracle databases for Oracle Standard Edition licensed installations. Goldengate have no reference sites for IFS/Oracle replication but will provide References for other ERP/Oracle replication implementation.
- Oracle replication speed. Implementation of Masergy network chosen for the very reason that it allows acceleration of application transactions specific to a single server or set of servers.
- Change of IFS version or Oracle version. The Goldengate software is version agnostic and will not require change of function if versions change
Hosting the DR environment outside of WNA. 2 Proposed Solution
Introduction
In the first part of the arrticle above. It was outlined for the potential vendors the background and objectives of a remote DR in-line DR solution. In essence, the choices boiled down to three options.
- 1. Move the current DR environment to another WNA Site.
- 2. Move the current DR environment to a hosted off-site location.
- 3. Move the current DR capability to a cloud-based solution.
After analysis it was determined that:
Option 1 would require a level of investment at one of the sites to bring it up to a standard that met the DR criteria. This would have involved enhanced infrastructure, including cabling, circuits, routers, Vlans, etc. It also required freeing up existing resources to manage the environment as necessary. Although, it met one major objective of not being located close to Chelmsford it still suffered from the disadvantage that the sites all route data through Chelmsford and even though alternative network routes could be devised it would make the general running of the network more complicated. Therefore the cost advantages were not as great as first seemed and fell short of the technical objectives.
Option 2 is a technically feasible solution but more costly as hosted data centers will charge a premium to maintain equipment not owned by them, because it is often unfamiliar, more time consuming to maintain and takes up floor space. From previous experience having our equipment hosted by a third party (Cincinnati Bell) was not successful as neither WNA nor Cincinnati Bell had enough knowledge of the total environment (network, software, application) to optimize the service.
With Option 3, WNA deploys its solution on a set of virtually configured servers and infrastructure secure from but alongside those of other companies. The host manages the environment – servers and OS but WNA is responsible for the application layer and above. This model can be loosely termed Infrastructure as a Service (IAAS). In this way the host can spread the cost of maintaining across many customers. Also, because the host is working on their own infrastructure the cost of maintenance is lower, which reduces the cost to the end customer.
Option 3 has become the preferred solution for WNA.
Selection Process
Requests were sent to a number of vendors along with the original DR scope document. Very soon this number was reduced to two potential vendors for which detailed discussions were undertaken. The two organizations were Logicalis and Onx. A little further on in the process, Onx requested a payment of $12,000 to undertake an in-depth assessment of requirements that would be refunded if WNA subsequently agreed to sign a long-term contract with them. It was decided to move ahead with working on a detailed plan with Logicalis and only revisit the Onx recommendation if Logicalis solution proved to be unviable. Recommendation
At the end of the assessment Logicalis provided a costed solution. The recommendation was based on the IAAS model where WNA would load its virtual server applications to the cloud and use the Double Take replication tool to ensure both OS environments were in sync. Logicalis gave us an option of locating the virtual environment in one of their hosted sites in Phoenix or West Chester, OH. The location is not really a factor in the decision unless the cost of running a circuit to the hosting site is prohibitively expensive, which is unlikely. Nevertheless, WNA choose West Chester, OH given its proximity to Cincinnati. However, there was one diversion from IAAS model as it relates to the IFS Production database itself. There is still not enough cumulative knowledge in the industry of running an Oracle database in a virtual environment. Tackling Oracle directly over this was no help as they can only ‘guarantee’ success if we use the Oracle version of Linux (we do) the Oracle recommended database replication tool (Goldengate – we do) and the Oracle virtualization software environment called Oracle VM (we do not, we use the industry standard VMWare). Consequently, we have to fall back to using a physical server at the hosted site, which add costs to the project but at least mirrors exactly the current DR environment that we have in Chelmsford. This means the:
Will be virtually replicated using Double take
The IFS Production database server (Oracle) will be physically replicated using Goldengate. Once it has been determined where and how to host the DR Environment the next step is to set up connectivity between the IFS environment and the hosted location. There are really only two options available. Either extend the WNAGlobal network to the hosted data center or extend the hosted data center network to the Chelmsford data center. The latter would almost probably result in added infrastructure cost at Chelmsford and would pose security complexities associated with dual circuits that we don’t have the experience to manage. Extending the WNAGlobal network provides greater visibility to the hosted data center and the network. The downside to this occurs if the network provider (Masergy MPLS) does not already have a circuit into the hosted site. In our case this was not a concern as Masergy already has a partner which has access to the proposed hosting site.
While estimating the cost of a new circuit in the hosted site, we took the opportunity to re-cost the whole group of services that we contract with Masergy on. The focus was on reducing the cost of the T1 circuits at the sites, replace the DSL circuit at Dorval with a T1 Circuit and provide an alternative way to transport network traffic destined for the internet. Masergy were able to recommend a new topology that added greater bandwidth and new circuits at an overall cost only slightly higher per month than WNA pays today.
Implementation Process
Upon approval of the implementation methodology and costs there will be a two-month transition towards a fully-live hosted environment. The milestones would include:
- Provisioning a new circuit
- Purchasing a new database server
- Setting up a maintaining the hosted environment
- Extracting the data and moving it to the hosted data center
- Testing
- Go Live
Disposition of existing DR Environment
The existing DR environment will remain live until the hosted solution has been shown to work then the servers will be repurposed for the virtual data center initiative at the Chelmsford Data Center.
Costings Comparison of Costs. Masergy Current and Proposed Costs.
On a like for like basis the Masergy quote shows a reduction is overall cost on a monthly basis. However it includes an increase in bandwidth from 6mb to 42mb. The Covington cost is reduced by the elimination of the firewall and is replaced by the virtual firewall, which is included in the DR costs. The Polar Warehouse cost is currently made up of a 2mb virtual port from Masergy and a 512k DSL from a Canadian carrier. This will be replaced by a Masergy T1 circuit, which up until recently had been a too costly undertaking albeit the best technical solution. The DR circuit cost includes a 10mb Ethernet circuit pegged at 2mb unless needed and a virtual firewall. Even though the virtual firewall is shown under the DR costs it will be used by all sites except Chelmsford for all site traffic that goes to the internet. This should balance the load much more evenly and avoid internet bottlenecks at Chelmsford. Physical Server cost As discussed above this was a hitherto unbudgeted cost as initially we believed that we could use Oracle in a virtual environment. As it has proved to not be the case a server comparable to the live Database server needs to be procured. The quoted cost of the server is $8,736.63 Logicalis Hosted Service Logicalis service costs can be broken into three elements
- Project Set Up Costs: $15,950 (fixed maximum cost)
- Monthly operational costs $2,678
- Monthly “Declaration of Disaster” Costs $954 plus $3,118 per month (For every month that WNA were declaring a disaster)
The third element reflects the extra work and maintenance incurred when WNA declare an emergency and switch internal and internet traffic through the hosted site. The contracts for both the renewal of the Masergy network service and Logicalis DR service run for 36 months.
Timeline: 2012
