By Mike Julian, Senior Technical Consultant at Taos
I have led or assisted in many monitoring projects over the years—too many to count. I’ve managed more than my fair share as a full-time system administrator. After a while, I’ve found myself giving the same advice to any one who asks, so it seems only fitting to finally write them down.
1. Monitor for the user and business first.
When you stand up a new monitoring platform, what’s the first set of checks you add? It’s usually system-level checks: CPU, memory, load average, etc. They’re easy and of course you need them, right?
Here’s a scenario: your company has a web app, through which all it’s business is done. Let’s assume it’s a standard three tier architecture, consisting of load balancers, application servers, and database servers. What are the critical things in this setup? Rephrased, what do your CEO and users care about?
Your CEO wants to know:
- Is the application working?
- How many new users are we getting?
- Are we making money?
- How much money are we making?
Your users care about:
- Is the application fast enough to be usable?
- Can I make a purchase?
Your business might have a few more key things, perhaps less, but you get the idea. While monitoring for a disk filling up can be useful, start your monitoring on the more important things: the business end.
2. Feel free graph all the things, but be careful what you alert on.
People have a tendency to alert on every single thing they monitor. Where did that idea even come from? It seems silly to me. For example, why alert for system load being over a certain amount? It doesn’t tell you anything useful by itself, and it may be a red herring anyways.
There are some metrics that are great for troubleshooting purposes, but alone, don’t tell you anything useful. Feel free to graph anything that moves, but be careful what you choose to alert on.
As for what to alert on, that depends on your business. Alert on the business and user metrics first, and go deeper later.
3. Include runbooks and metadata for all alerts.
Ever had the pager wake you up at 3am with an alert that you have no idea what it’s trying to tell you? Every major monitoring tool out there has the ability to include additional data in an alert or check output, and this is a great place to include additional information about the alert.
Linking to a runbook is a great use of this functionality. When an alarm goes off, I want to know why it’s doing that, and how to make it stop. A runbook is a good way to answer those questions, and as a bonus, helps spread the knowledge of the infrastructure around to all of the engineers.
What’s a runbook, you ask? A runbook is a document that explains how a service works and what to do when it goes bad. Having specific sections in a runbook for each alarm related to a service is an easy way to make the on-call experience much better.
Another really great use is to include additional information about the check. For example, if you have an alert for a disk filling up, wouldn’t it be neat if your alert also included a graph? At 3am, my first question is going to be, “Is this a sudden change, or one that’s been building up over time?”
4. Monitoring is never done.
Many monitoring systems come with default alarm parameters for checks. For example, Nagios’ default CPU check alarms at 80%. Is that a good thing? A bad thing? I have no idea, and I’d venture that you don’t either.
Instead of setting a threshold and leaving it untouched, tweak them over time. Start integrating statistical analysis into your checks, perhaps. Write new and better checks that more accurately capture the state of your infrastructure. Keep tweaking those.
I’ve noticed that often people want monitoring so that they can say “it’s monitored”, but that’s not really helpful, is it? How often does your infrastructure stay exactly the same for months at a time? The answer is “not very often”, so why should your monitoring be any different?
5. Monitoring is the job of everyone.
Over the years, I’ve noticed a common anti-pattern: “the monitoring administrator”.
This is the person who not only manages the monitoring systems, but is also on the receiving end of all tickets asking to add things to monitoring. Commonly, the person that ends up with this task is simply the person who likes monitoring the most out of the team.
In my experience, a single person being responsible for monitoring always results in immature monitoring. This anti-pattern represents a cultural belief that monitoring is a separate concern from the service being monitored, and is thus relegated to an after-thought.
Monitoring is a critical component of any service, like performance and manageability. Bolting on monitoring after-the-fact is a difficult thing to do, and even more difficult to do well. If instead, you think about how you’re going to monitor the service while you’re building it, and set up those hooks then, you will always end up with more mature monitoring. The person building the service is the person who knows the specific failure scenarios and metrics of the service, so it only makes sense to them to also build in monitoring to it.
6. The “one monitoring tool” is a myth.
Hoping for a single tool to tell you how your applications, network, and servers are doing, in enough detail to be useful, but not so much that you’re overloaded, is something everyone wants but no one gets. There are a multitude of enterprise software companies claiming to solve this problem, but I’ve never seen it be true.
What I have seen work is embracing the UNIX philosophy: one really great tool for one single purpose, then couple them all together. Taking this approach, you will end up with a more mature monitoring environment and better flexibility.
A common objection I’ve seen to building something in-house is cost. However, in the cases I’ve seen both approaches, the cost has usually been about the same (vendor licensing + integration hours vs staff hours), but the business has always come out ahead with the DIY approach.
In conclusion, there is a lot more to mature and effective monitoring than simply picking a tool and calling it done. Effective monitoring requires work from everyone, but such work has immense benefits in business insight, performance analysis, and generally having the confidence in your systems that you sleep well at night. I hope this article has been educational and informative, and that you strive for more effective monitoring in your own company.
Mike Julian is a Senior Taos Consultant, specializing in open-source monitoring architecture and integration. He formerly worked for a large hosting provider, where he led a global monitoring platform deployment, and prior to that, at a US National Lab, where he led organizational-wide monitoring efforts.