Service Offerings Solutions Clients Employment Management Case Studies News & Events Contact Home
Taos, Inc.
Newsletter
Events

Interview with Bob Quinn and Tim Campos of KLA-Tencor

Recently our COO, Coco Brown, sat down with CIO Bob Quinn, and Senior Director of Enterprise Services Tim Campos, both of KLA-Tencor to discuss Disaster Recovery. With offices all over the world and major facilities in the US, Europe, and Asia KLA-Tencor’s main Disaster Recovery concern is the intention to standardize its enterprise operations around SAP.

 

Coco: Tell us about IT at KLA-Tencor.

Bob: We have a pretty classic organizational structure divided into two main areas. What we call enterprise services, which is what Tim Campos runs is our infrastructure and support services. So that includes network services, client services, technical services and engineering services. Then on the other side of the house we call it application development and support and that’s our enterprise applications and architecture development and support services. The IT organization is roughly one hundred and fifty people strong. We manage about 60 terabytes of data for the company. We are an SAP shop having just migrated from Oracle.

Coco: What were the important factors you knew you needed to consider in designing your DR plan?

Bob: Foremost was speed to recover to allow the company to meet the minimal requirements to run the business. For the most critical business processes that we were thinking about in this phase of our DR planning we knew the business could not afford to be down for more than a matter of days at most. As such, we felt proximity to the bay area was key. Our previous solution included a cold site in New York, but we wanted as close to real-time replication of our key business critical applications as possible which meant reasonable proximity for fiber connectivity between the datacenters and the ability to get our people physically located somewhere reasonable in the event of a major disaster.

Second, we wanted to make sure that the applications that were covered in a DR plan were sufficient to run the operations of the business. In our previous plan, all the right applications were covered at a high level, but as we dug into it we discovered we didn’t have all of the right pieces of those applications. This time we wanted to ensure that we didn’t just identify the applications, but rather the critical business processes and all the points of interface and systems that support those. In this way, nothing essential was missed.

Coco: Prior to doing this, did you go through a formal Risk Assessment?

Tim: A risk assessment had been done for KLA-Tencor a year prior to implementing this DR plan and that risk assessment essentially said that the DR plan that we had was adequate in what it covered, but that it was missing a number of critical application services. It basically said that the five applications that we had covered in our DR plan represented roughly a third of what we needed to have. We were missing two thirds of the supportive applications necessary to run the business. It also indicated that we should include a number of other applications which were not as critical to the business. The risk report indicated to us that 1) we needed to get the business involved in defining the critical business processes and the applications and systems necessary to support those, and 2) we needed to design a DR plan that would allow for a phased approach; that would allow us to offer more services in the future as funding made it practical to do so for the things we set aside right now.

Coco: Why was it a year between the risk assessment and the actual implementation of your DR plan?

Bob: The risk assessment was driven by our safety organization and part of something that was necessary for insurance, and the ERP implementation was taking place at the same time. There were so many things up in the air that it did not make sense to make a change to the DR solution until the systems architecture was in place. In addition, our former DR contract did not expired until March of this year. We needed to switch partners as the former was in New York; therefore unable to meet our proximity requirements.

Coco: How did you do the analysis to determine which applications were essential to your critical business processes?

Bob: First we created three scenarios for speed of recovery; the fastest costing the most. Generally when you ask a business manager “what do you want” you hear “this thing can never go down”. But as soon as you associate a cost with that they start focusing on what they really need.

To get to what we really needed and at what levels of recovery, we leveraged the steering committee that had been put forth for the ERP implementation. Together we went through the prioritization process. In the first phase we drew the line fairly high – SAP and the applications around it, and then of course email. From there we decided to get the DR site up and running and then be able to add other systems over time.

As we were going through the decisions we were tallying the costs of each of the individual decisions and eventually it came down to an overall stack ranking and cost/risk analysis. At this point we went back through and asked ourselves if our decisions on tiering made sense; was there consistency or had we created a cornucopia of confusion and chaos where we were not leveraging common approaches for applications dependent on one another.

Coco: And were there points where critical data assets or applications were on the border of being critical but from a financial standpoint, or some other decision factor, you decided to put them on the side of not critical?

Bob: Yes. For example, we had an application that showed up as the number one priority in the business impact analysis that we had done a year prior to this DR implementation, but it wasn’t a transactional system. It is a document management system; an archival system. It turned out that the way the business had prioritized things was really heavily focused on product development, which is the side that is probably the most tolerant of an interruption of services.

When we looked at what we needed to include in the DR list, the transactional systems, the things that support selling, manufacturing, shipping, servicing, those are the things that if you loose and hour, a day, a week, it starts to materially affect your financial results. Some of those had been left off the prior DR plan.

Tim: One really has to think through all the interdependencies. For example, you can’t have SAP and Clarify together without an interface so we needed to have the system that was doing the AI to make the solution work. We ended up having to make tweaks in both directions, pulling things off and putting things on once we finalized the list of applications.

Coco: Did executive management require a quantitative business impact analysis in this process?

Bob: Yes, very much so. It was predominantly a risk assessment of what would be the impact of a failure or destruction of the datacenter locally, how long we could afford to be down and what the cost impact to the business would be relative to what it would cost us through various scenarios of recovery.

Coco: What were the various recovery scenarios you settled on?

Tim: We created three scenarios to allow us to bring the business up in a choreographed manner based on the priority of the systems.

For the least critical applications we could use a cold site, a standby network with the need to acquire and configure servers to get it running. We recover data and applications from tape. In this scenario recovery could be over a period of weeks. For secondary applications needed by the most critical applications to support business process, we could use a warm site with system capabilities in place and only the need to configure the application. In this scenario it takes a matter of a few days to recover the application.

Finally, for the most critical applications, we determined real-time or close to real-time volume replication was needed.

In the end, the general architecture we settled on for the disaster recovery sight is a live datacenter within a hundred miles of KLA-Tencor. This allows us to meet the real-time replication requirement for critical applications. Also, it is within driving distance but not susceptible to seismic interruption, on the same power grid, or in a flood zone. We wanted to be able to take care of all but the most extreme disaster scenarios.

Coco: What about the geographic dispersion of the KLA-Tencor offices. How does that fit in to your DR strategy?

Tim: One of the things that we have done, complimentary to our DR strategy, is to go to regional hubs for our office support in our regional sites. The intention of that is to create the ability to provide disaster recovery capability at hubs and between hubs so that in the case of a disaster we can move our office support systems from the disaster area to a regional hub or if a hub is damaged from that hub to another.

Coco: How is DR being handled for telephony and call center?

Tim: The call centers have their own DR capability with AT&T. In addition, they are not in the same facility as our information systems. If they become isolated from the information systems or they go offline themselves, the calls can be routed to another location through AT&T’s telephone network. What they require however is availability of the service delivery platform. They need Clarify in order to function and that’s part of the DR plan.

Coco: What technical hurdles did you have to overcome in putting your DR plan in place?

Tim: First was the design of the network. We had a significant architectural debate on the approaches for this. There were several options proposed.

1) One option was purely DNS based meaning that in order to activate the DR site we just needed to make a DNS entry change. That was ruled out because of some concern about the effectiveness of DNS propagation.

2) A second option was to have the same network address space in the DR site as our production datacenter. This had the convenience of avoiding all the DNS problems but it was extremely risky in testing. If you were to test this and you did something wrong you could accidentally activate the DR systems as your production systems and cause a significant outage for the company.

We went back and forth on these two options for quite some time, trying to figure out what would be the right approach for KLA-Tencor given the facts that:

  • This datacenter is always on
  • It needs to be accessed for administrative purposes
  • We don’t want to have an involved procedure to activate it

Eventually we went back to a hybrid of the DNS model. In retrospect that was the right choice given some of the challenges that we had in testing the DR site.

The second most significant challenge was getting the replication set up. This was something that had not been that difficult to implement in the lab but implementing over a wide area network turned out to be much more difficult than we thought. We really came down to the wire to get the replication working and functional by our target go live date for the DR site.

Coco: How often do you plan to test and reevaluate your strategy?

Tim: We had been testing every six months and that’s what we have chosen to stick with for the time being, however once we get the test procedures formalized sufficiently we think that the impact of the test can be low enough that we can increase it to a quarterly test.

Coco: How does your DR strategy fit into an overall business continuity plan?

Bob: The company had to develop a business continuity plan in order to implement SAP. We had to shut down our systems for a week in order to make that work. It was a great opportunity for business managers to be forced to think about how they would do things in the event that the systems are offline for an extended period of time. So in the implementation of the ERP environment we think that we have developed the procedures necessary for an overall business continuity plan within the ERP environment.

Coco: In that sense is IT paving the way for the business as a whole to start thinking about business continuity in other areas of the business?

Bob: We have a department of safety management, which reports into our legal council. They look at business continuity on a global level for the company. In addition, individual organizations have business continuity plans and each of those have looked to us to support those plans. In many cases they focused on how they would move to alternate facilities for manufacturing, call centers and so on. Each organization is required to consider their business continuity and in doing that each of them looks to IT to support those with commensurate plans for recovering their systems. The one difference in the DR plan we’ve been talking about with you is that we’re addressing applications that are company-wide; of an enterprise nature.

Coco: If you were going through the DR planning process again, is there anything that you would do differently?

Bob: We made the right choices in terms of where we drew the line on this phase. The project actually went relatively well. It was difficult to do at the same time as the ERP implementation, but at the same time a lot of things aligned to make this work very well, the previous contract terminating, the change in the systems architecture, those created the opportunity to really get the right DR plan in place.

Coco: Do you have any advice for others going through this process.

Tim: It is definitely important to have a business impact analysis done prior to implementing a DR program so that you can prioritize what to focus on. Disaster recovery is, in a sense, insurance and like all insurance you don’t want to spend so much money on the insurance that you bankrupt yourself. The way to avoid that is to make sure you’re prioritizing your investments appropriately and you need to have an understanding from the business of what that actually means from a service perspective.

Bob: Make sure your focus is the business, not technology. What we saw in the previous DR plan was that it was designed from the technical standpoint of what IT could do for the money they had. It did not adequately consider whether the result ended up in a viable business solution to manage the right level of risk.

That is the trap most IT organizations run into. They know they need to do something and they start out with a budget and then figure out what they can do with it from a technical perspective. The approach we took was, have a conversation with the business folks and let them make the decision. In the end, budget is a big factor but it isn’t the leading one. It drives discussion around trade-offs and where to draw the line, but you do what you have to in order to protect your core business processes.

Tim: A final thing I would advise, that I think we did very well, was to align the DR program strategically with other things that we had to do for IT. Implementing this solution, not only improved the capability of our DR system, it:

  • gave us a foothold in a datacenter that will ultimately enable us to reduce our investment in our datacenter here, which is a cost,
  • established a relationship with a managed service provider which enabled us to leverage an outsource relationship to supplement our IT staff,
  • helped us mitigate risk with Sarbanes Oxley controls,
  • helped us reduce insurance costs, and
  • helps us evolve our network architecture by having a presence outside the immediate bay area. This lowers our cost of operations globally for network access by enabling us to get cheaper access to tier one network providers directly from the datacenters as opposed to having to run additional lines to our facilities here.

So, there were many peripheral benefits that we were able to derive from implementing the DR solution.


Robert Quinn is Vice President and Chief Information Officer at KLA-Tencor where he overseas IT Operations, Business Applications, Support Services and Engineering IT Services. Mr. Quinn’s background includes 25 years of IT, Technical Operations and Engineering experience. Prior to KLA-Tencor, Mr. Quinn managed Business Operations at Applications Services Provider Portera Systems, which provided hosted applications services to Professional Services organizations. Prior to Portera, Mr. Quinn was CIO and VP of Site Operations at eBay, where he oversaw transition and architecture in Managing eBay’s hyper growth immediately following IPO. Mr. Quinn also Managed IT organizations across Sun Microsystems over an 11 year period of the computer company’s growth from an Engineering Workstation player through the .dot com boom. Mr. Quinn holds degrees in Economics and Finance from Baldwin-Wallace College and is active on numerous technology groups and advisory boards.

Timothy Campos is the Senior Director of Enterprise Services at KLA-Tencor where he is responsible for all IT infrastructure and support services globally. Mr. Campos's background includes 13 years of industry experience in both Information Technology and Software Engineering roles with an emphasis in global application hosting and multimedia data management and distribution systems. Prior to KLA-Tencor he held management positions at Portera Systems and Silicon Graphics, as well as software engineering roles at both Silicon Graphics and Sybase. Mr. Campos holds a degree in Electrical Engineering and Computer Science from the University of California at Berkeley.

© 2004, Taos Mountain, Inc.