By James Sparenberg – Unix/DevOps Consultant
My son plays baseball. He’s a catcher. As a catcher, he is charged with seeing the entire field of play, and calling actions out to his teammates based on his predictions of what is going down. He’s told me before, “Dad I have a choice, I either catch the game they throw me, or I make them throw the game I want to catch.” If he waits to figure out what he’s going to do if the guy on first tries to steal on the pitch once the guy has started to run. He’ll never stop them from trying to steal. But if he learns the patterns a runner executes when they intend to steal, calls a pitchout, has the shortstop move over to 2nd at the right time. He’ll look like the hero and get the runner out.
This same thing applies to you and me in the world of a modern IT dept. Let’s take a database as an example. It would be useless to recount the number of times I’ve seen organizations trying desperately to cool off a burning database, that is no longer supported by its manufacturer. (FOSS or Closed Source). Often this DB is also on Hardware that can only be repaired if you order from E-Bay. The IT team is spending every waking moment sweating over that DB. Any change needed to meet growing customer expectations, or required by InfoSec, etc. Is met with extreme resistance. Other systems end up being stalled as well. All because in the past someone didn’t “fix” the DB before it started breaking.
Move forward now, you’ve finished the 6 to 12 months of hell trying to upgrade it. You now have a shiny new OS with the latest version of MyNoSQLacle 2016 up and running like a top. Whew. Let’s move on to other things. After all the DB is no longer broken. So if it ain’t broke don’t fix it right? Wrong. Now is the perfect time to begin the process of fixing it, so that you don’t end up with another festering pile of Band-Aids again in the future. Going back to the catcher analogy, with the old database, we approached it as if the pitcher threw the ball, and the catcher said. “Oh yeah, here let me get my gear on, I’ll be right there”. If the catcher were to react like that, the guy on first would steal 2nd 3rd and home, and the umpire would be screaming bloody murder, as he just got hit by a hard fastball as you weren’t there to catch it. In the world of IT. We are losing sales because our DB is down. Our competitor is stealing our sales because we are busy rebuilding a DB server. Finally, the CEO is screaming bloody murder because he/she just got blindsided by the board, and didn’t have an answer they would accept.
As a proper DevOps style IT team, the team needs to be like in charge of all possibilities. We need to know exactly how we will react, in a situation we haven’t yet experienced. Based on the knowledge we have gained over time. Some of the phrases we need to lose are:
- Everybody knows [random belief]. — No, they don’t. Assuming shared knowledge can be counterproductive.
- Don’t worry, X never goes bad, breaks, fails. — Yes, it does, and it will always do it at the worst time. It may not happen as often as Y but it does happen.
- It’s in the Wiki. — Is it? Can people get to the wiki if a disaster happens? Has it been updated this century? What if it’s the wiki that fails?
- Don’t touch that, Ellen is the only one who knows how to keep it running. — Ellen meets bus, sites down.
- Protocol states we can’t do that. — If everyone is constantly butting their head against this kind of wall, you need to invent a new way to work that adheres to the protocol or create a new protocol that allows you to work.
- We’ve never done it that way before — Just because you haven’t doesn’t mean you shouldn’t, nor does it mean you should. You just need a better justification for whether to change or stay.
- Well, best practices say we should…. — — Stop practicing. It’s time to dive into business needs and develop new and “best” set of patterns meeting the needs of your business.
If DevOps and your organization are to succeed, the methods that got you into the mess you were in, can’t remain. Fix it now.
So how do we go about it? How do we move from reactive (firefighting) to proactive and preventative? Where do we begin? It’s really just a matter of taking a different viewpoint and learning a new question. Let’s start with Where do we want to be? Or What do we want? The dismissive answer is, “I want the DB up 24/7 and never do I want to be in the shape we were in again.” Fair enough. It may be thrown out as a way to dismiss the question as being stupid, but it’s true, it’s what we want. We just need to define how to get to it.
Let’s start with the idea of UP 24/7. What does that really mean? Does it mean the box is turned on? It can be pinged? From what viewpoint do we define it as being up? The internet? The LAN? Both? What really defines being up is what the business needs, and what the business needs, in this case, is that a customer coming to the site from the US (This company doesn’t export), needs to be able to view all pages and data. They need to be able to place an order. Then pay for that order. All without any issue or encumbrance caused by our DB. In short the needs of the customer, and as a result the needs of the business, are driving the definitions of what uptime is. It doesn’t matter what the ping time is, if the order can’t be written to the DB, it’s down. No matter what the dashboard says.
The second half of your want is that you don’t ever want to have to go through the hell you just went through again. Your company lost 10% of its quarter and the head of IT went with it. You lost most of your life and countless hours of sleep. Now that we know what we want. Let’s set up a pattern to get it. It starts with an attitude. Just assume it can be done. Ladies and Gentlemen this is software. If we can define it, we can do it. Think “how can I” not “Can I”. This pattern allows you to open up your mind and come up with possible, not impossible. Now we have defined what we need, we find that what we need is to meet the needs of the business.
So at this point take a look at what your business really needs, in detail, out of the DB (or any other device/product in your company). For example, the consumer needs it to be available when they want to purchase. For your DBA it needs to be monitored and not in danger of running out of resources. For your IT team, they may want to be able to upgrade the OS it runs on and the DB software itself, without taking down the site. For the Hardware team, they may want to be able to upgrade and replace hardware without taking down the site. Agile refers to this as user stories. You need to get your various stakeholders in a room and ask one question. What do you want? Management and Sales probably have a stake in this too. Beyond just, the site has to be up and functional 24×7. They want reporting and metrics. All together these are the fixes you need to apply now, in order to ensure that later, it won’t break, and if it does, you are able to fix it and fix it in line with expectations.
The next step is the fun one. The best way to fix a product that isn’t broken. Break it. Break it in every way you can, and then fix it. Document the fixes, and break it again. Netflix has a great product for this. It’s called “Chaos Monkey” This program runs constantly against their site. Taking things down randomly, at random intervals, and without notice. IT’s job, find the break, fix, document if needed. Rinse, repeat. If you have one DB server and it coughs up a motherboard, do you hope the backup is good? Most do. The time to fix the backup is when the site isn’t broken, did you? When was the last time you checked your backups for quality? Heck, does every member of your team either knows how to restore or have a script so detailed that they can do a restore cold? How long does a restore take? If ½ of your HA cluster fails. Will the other half support a spike in load? How much data can you afford to lose?
This doesn’t just apply to DB’s by the way. Code repositories, Web pages, system configurations, network maps, etc. These are all data points that, if missing or unavailable, spell disaster for your company. The last time you want to be caught trying to draw out a data map of your network is when there is a switch down and you have no idea what connects directly or indirectly to it. Fix it now, while it all works. Draw out that map. Get that map into an environment that is safe, and at regularly scheduled intervals, get it out and verify. I don’t know how many times I’ve heard “Does anyone know what a box/switch named XXXX does? Years ago a major university, had a small closet that got walled off. In it was a single server that was still running. (The drywall guys were there to wall it off, not make sure it was ready). A few years later that box failed and started causing all kinds of issues. If they had fixed their network and drawn proper maps. Then verified the maps, they might not have spent panicked hours, searching for the cause of the campus outage. Then needing to tear down a wall to find out why some Ethernet cables went into a wall. A network map and systems inventory would have found the issue, probably before the wall was built.
Some git built what is now the world’s most popular version control software. Named it after himself (GIT) and made it freely available. There is no longer an excuse for IT departments to not version control their configurations and scripts. Multiple products are available in commercial and FOSS forms for managing configurations and installations so that you don’t have to invent the wheel during a crash. Allowing you not only to repair but to replace with a functionally identical version. If your expert goes under a bus, tools like this, designed to automate and fix before the break, will allow you time for a funeral. The person who met the bus, not your team in front of the CEO. The only wrong decision you can make in a case like this is the one to do nothing or spend endless hours trying to pick the perfect tool. Choose wisely, just don’t get addicted to the process of choosing.
In your environment, what is it that isn’t broken, but needs fixing? Network maps, Documentation, Procedure updates, heck even something as simple as a phone list of team members, can be the difference between a catastrophe and a major problem that got handled. What is it that you should fix now, when things aren’t broken, instead of later when they are. What is a “Standard” install for your environment? Define, refine and test. Then test again. When you’re done, you might want to test again. Because your chance of success when it’s real is dependent on your quality of fixing before the break. Make sure that your environment always pitches the game you want to catch. If you do, you’ll have all the time you need to pursue a life outside of your cube.