|
March Issue of the Taos Newsletter: Utility
Computing
CIO Interview with Joe Sura
Vice President of I.T., nVIDIA Corporation
Taos: Tell me a little about your company and your organization.
Joe: nVIDIA is a fast growing, fabless semiconductor company with revenues in the range of $2 billion and about 1800 people. We have about 75 people in IT. About half of them are IT operations folks and the other half support Information Systems such as SAP, web development including intranets, extranets, business applications - those kinds of things. We have multiple computing environments for Engineering and corporate systems. We have a support team for the engineering users and one for the engineering compute farm. We also have another support team in IT Ops who is responsible for the systems administration of corporate applications and the NT servers in the company - over 600 NT servers are currently on site.
Taos: How do make use of computing farms and distributed systems to help with chip design?
Joe: Well, basically we have this massive array - this organic array - of compute servers. For Engineering, we have 4,500 Linux CPU's, about 200 64-bit Linux cpu's, and about 600 Sun CPU's. We run a wide variety of EDA/CAD tools from Synopsys, Cadence, and other vendors. Engineers submit jobs to LSF (Queue manager) and those jobs are submitted to any available node. We have several different queues established for the various types of jobs and timeframes required for each run. For example, some queues are established for jobs that take 4 hours or less, other queues are for jobs that take 12 hours or less, and so on. We have priorities established by project... reflecting what's most important for that particular month or week. Once we program this into LSF and have these roles established, then LSF will do submissions to whatever machine is open according to the rules we establish. On any given day, we are doing over 350,000 simulations on the Linux farm alone - that's not including the Sun farm. We're heading up to half a million simulations a day, so it's quite demanding. We're basically pushing the limits of the queue manager, servers, and disk subsystems - because of our incredibly high volumes. It's the nature of the work and our business. The more we can simulate, the better the product will be at tape-out, and the faster we launch quality products into the market.
Taos: Why did you decide to use this type of compute farm?
Joe: We needed massive job capacity and scaling to support engineering. It took us a while to master this computing model, but now, we are applying a "cookie cutter" approach and scaling the farm upward each quarter. It all goes back to the pace of engineering. In today's environment, there is always a critical project in the pipeline and projects are closer together than ever before. Tape-outs are just weeks apart, so you always have some critical activity in progress. There's no clear space anymore. So, if you can imagine, these things are now much more compacted in time. If I take a slice in time and look at what our utilization is, it's very high on almost any day of the year. The challenge is to make sure the right projects get the right priority and support our engineering schedules. Even with the massive investment in computing resources, we still have thousands of jobs waiting in the queue at any point in time for resources.
Taos: How do you go about scheduling jobs and prioritizing which computing resources get used?
Joe: We work on a weekly basis with Engineering to define their project priorities. We adjust the LSF queues and rules set to match those requirements constantly. Projects that must tape-out quickly always get priority treatment. There are different types of machines applied to the different stages of design. Early stage work requires relatively short job durations, but there are tens of thousands of them. At the early stages, we are working at the "cell" level. As you traverse the design cycle, you begin working at the "full chip" level, which requires larger machines and much more memory. Full chip place and route, timing verifications, etc. take weeks to run. Speed is of the essence to us... we are always seeking ways to compress design cycle time.
Taos: What kind of systems do you run in your compute farm?
Joe: As I mentioned, we have about 4,500 32-bit Linux CPU's right now with different generations of Intel processors in them. In the 32-bit farm, we generally use P4's and Xeons. Each quarter, we install faster systems, faster bus architectures, and faster memory.
We're also growing our 64 bit Linux farm aggressively. Now that Synopsys and Cadence tools are ported to 64-bit Linux, we are migrating to new platforms quickly. Our mid-sized CAD jobs - that we were running exclusively on Solaris before - are now ported over to Itanium II's and Opterons. And we're getting very good results. Jobs run very, very fast and the platforms are very cost competitive.
We have walls of "1U" configurations and a lot of upright, blade-like systems. We are looking into true blade architectures, but so far it's mostly 1u's.
Taos: What percentage of your systems is down at any given time?
Joe: Anywhere from 1-4 percent. When you have this many machines, compacted this tightly together and you have this much heat - you're going to have some failures and downtime. That should be expected. Our challenge in these high-density "Grid" computing farms is to be notified quickly of failures, take them out of commission, get them repaired and back into actions as fast as possible. It's a 24x7 problem. We keep some spares and parts on hand and do some repairs ourselves. We just rip them out of the racks and slam a new one in and keep going. Just push a button and it comes up and reappears in the queue. We always try for a hundred percent uptime of the grid and we get very close to it. Reliability depends on not only the equipment but also the cooling and power conditioning. We have a lot of compute power per cubic foot. We have 6 IT System Administrators supporting the entire Engineering farm.
Taos: What are the issues with physically hosting that many servers?
Joe: Cooling and power mostly. In our two current datacenters, we ran out of air conditioning and power before we ran out of space. That's typical and that's one drawback in going very dense. We are now building a new, state-of-the-art Engineering data center in Santa Clara. Eventually, this facility will draw about 7 million watts of power... ~4 million watts to power it and ~3 million watts to cool it. We will have all the associated UPS systems, diesel generators, cooling towers, chillers, etc. required to support this new facility. We will start out at ~10,000 sq/ft and will expand to ~15,000 sq/ft over time in our new datacenter.
Taos: What are some of the most difficult issues of running this compute farm?
Joe: Applying patches and distributing them across thousands of systems. This is a constant effort. We do this a rack at a time - taking cpu's out of the queues, flushing them, apply patches, and bringing them back up. We run on a commercial version of Linux (Red Hat 7.2). There are some tools that have performance issues with certain kernel configurations. My team does kernel patches - applies them across thousands of systems without a lot of disruption. It's one of the biggest challenges here.
The other is just learning how to master this ultra-dense configuration. Some of the tools don't scale well, and some are much more refined now than they were before. In the past, there were equipment issues we ran into. For example: when do you reach the maximum capacity of a filer head? You discover limitations when you slam 50,000 jobs onto a piece of equipment at once. We found out a lot of these things ourselves and are working with vendors to increase capacity on equipment all the time. Generally, the equipment we use is much better now than it was two years ago. We're able to stay in pace with engineering without equipment becoming an issue.
We have close to 400 terabytes of storage in the facility. That was another challenge that we experienced - getting storage that was fast enough to keep pace with thousands of CPU's doing submissions all at once. A lot of our equipment is Network Appliance Filers - the latest generation - 980's, 960's - with very high-density storage arrays on each one. We also use EMC, IBM and Hitachi equipment.
Taos: How were these demands satisfied before you built these compute farms?
Joe: Hand submissions of jobs, many less jobs-per-day, and a much smaller compute farm. We have been at this for almost 6 years now. We now have a scalable model and have always tried to stay with, or slightly ahead of engineering demands. We can now scale our environment up to half a million jobs, or even a million jobs a day. The basic model is there. We will continue to grow our environment. Equally important, we are always looking into our Engineering methodology itself to see if there are ways to achieve the same results but run less jobs. This is another area of huge potential for us. Working smarter... and doing less.
Taos: How did you manage these computing resources before you created your compute farm?
Joe: In the past, it was manual submission. In other words, you hand-submitted jobs onto a particular computer and waited for results. Now, you just submit it to a queue manager and it does its thing and returns it back to the originator. In the past, it was mostly Sun equipment, very little Linux. Now it's completely flipped over - mostly Linux and some Suns. The Suns are dedicated to the very largest jobs, which have some of the longest run times at the later part of the design cycle - where you need large memory configurations. 64-bit Itanium's and Opterons will offer us additional options going forward. We've actually been at this for at least 6 years. We were early adopters of Linux... and early users of GRD and LSF as early queue managers.
Taos: Do you have any special hardware or software vendor relationships?
Joe: I'm not sure it's special. We have preferred vendors in certain areas, but we're not married to any one of them. If we see a new technology out there that has clear advantages, we adopt it. But it has to be proven in a high volume environment. A lot of times, we'll review stuff that sounds good on paper and then we'll immediately blow it up when we put it in production in this environment. So you have to be careful that the equipment is scaleable and that it is reliable and that you get good customer support.
Taos: With so much redundancy, does that allow you to have lower grade maintenance contracts on hardware?
Joe: In some cases, yes. In some cases we have spares on site. Some cases, we do our own maintenance. In others we have very rapid one hour or four-hour turn-around cycles for certain critical equipment, like our Cisco routers. Some things we just do ourselves. For instance, I have no service contracts on any of my 32 bit Linux machines. We just do all the maintenance ourselves. Currently, there are 5,000 systems with no maintenance contract.
Taos: What new technologies do you see on the horizon to help with your compute farms?
Joe: Opterons and Itanium II's. This is something we're moving into pretty quickly. The servers are getting much faster - the bus speeds, faster cache, more memory. Basically, a lot of our requirements are for jobs that keep growing in size - needing more and more memory on board on the systems. It's not only the speed of the CPU's but how much memory can we fit on a system.
Taos: Are software vendors working on features to help maintain such a large computing environment?
Joe: Yes, they are. I think it's a learning exercise for everybody involved - for the computer manufacturers, for the LSF, GRD companies - doing queue management on a massive scale. We're probably the biggest implementation of LSF in the world, but it's not enough. We're going to be doubling and tripling this environment in the years to come, so the tools need to keep up.
Joe Sura has been Vice President of I.T. for nVIDIA since July 1998. Prior to joining nVIDIA, Mr. Sura was Director of Information Systems for Cidco Inc, a telecommunications equipment supplier. Mr. Sura has held a variety of senior IS positions at companies including Sun Microsystems, Silicon Graphics, HaL Computers and Hewlett Packard. Mr. Sura holds a B.S.C.S. from Florida State University.
|