Author’s Note: This is a 9 part Monitoring Series. I recommend reviewing them in the order they are published for cohesiveness.
I know what you’re thinking, “let’s get on with it, show me the good stuff!” Well, this is the good stuff. Unless you know why you’re monitoring, and you take the time to create boundaries, you will end up with a bunch of useless data and system overhead. Put another way: Measure Twice, Cut Once.
Let’s explore a set of principles to use when monitoring an application.
Start with the end in mind
I think that everything we do in life fits nicely into this principle so let’s start here. Begin your monitoring design and landscape creation by considering what you will be using the monitoring for. Collecting data for the sake of collecting data is best left to three-lettered US government agencies, why gather data you’ll never use?
Begin by looking at the users of the monitoring system. Are the users operators? Business folks? Are they technical? Will they be looking at a moment in time, or is a historical summarization more appropriate? The questions you may ask yourself can go on for some time, but take this time to get to know your audience.
Consider such use cases and usage patterns such as capacity planning, application debugging and simple data exploration when designing your monitoring solution. And while you’re at it, make sure the monitoring output is generally available to everyone and easy to consume, you may be surprised what others use your monitoring solution for and monitoring data should never be considered private or secret!
Note: There is a case for secret monitoring, but that is out of the scope of general, or common, monitoring use cases.
Choose scalable monitoring
It may be tempting to pick a tool and
run with it because you know the tool or it was recommended to you, but you need to make sure that your monitoring scales just as quickly as your application. A monitoring system that ceases to function for any reason is useless, or worse it could provide false insights or have glaring holes. In the absolute worst case, monitoring could negatively impact the application you are intending to observe.
Start by considering your scale decisions made during your application’s design and apply these same design decisions to your monitoring solution. In almost every case it is advisable that the engineers working on the application itself are also involved in the requirements, the selection and implementation of the monitoring solution.
In most cases a passive monitoring solution is preferred over an active monitoring solution. Active monitoring (polling) systems tend to reach their scale limits quickly as the number of systems, applications and metrics being monitored expand. Expansion tends to happen exponentially. Passive monitoring solutions allow systems and applications to distribute the health checks and data collection, reporting either an aggregated or summarized state to a centralized solution.
Right tool for the job
This is another principle that fits with most things in life and you certainly don’t want to use a screwdriver to pound yourself in the foot. Make sure you consider what you will monitor, from the data points to the applications themselves. Technology selection is always a tenacious beast, but a monitoring tool that closely mirrors how your application is designed will pay dividends down the road.
Look for tools that do exactly what you want them to do. Resist the urge to find the one tool that does everything! (hint: it doesn’t exist…but that’s another blog post) You may end up having many tools and it may start to feel like sprawl. It may, in the end, become a sprawling set of monitoring tools! Does a carpenter have a tool belt with only one tool?
For early systems expect to start small and grow, adding additional tools or making changes along the way. It is rare to see a system “born” perfectly from the start.
Integrate with your application (internally)
Along with tool selection, working with the application engineering team will allow you to provide the right hooks into the monitoring solution. The best monitoring I’ve seen allows for an application to report it’s health and key metrics to a centralized solution; rarely does introspection without context end with a successful solution.
Providing well documented APIs and an SDK will allow engineers to quickly and easily add to monitoring. Code samples and development involvement will go a step further. Whatever your mode of operation deep integration into the application is a key element, making it easy will lead to success.
Integrate with other systems (externally)
Nothing is worse than a monitoring solution that provides great graphs and alerts, but there is little context. Integrating with systems such as other monitoring system, releases processes, or a CMDB provide great insights into what the monitoring is trying to tell you.
If there are no other systems to integrate with initially you must still make certain that during technology selection these types of extensions are available.
Knowing exactly how your system is performing and the current state of health is critical. Whether it’s a quick glance or an automated alert, current state is critical! But what is current state without a baseline to measure it against? For this reason the history of a system’s health is imperative for determining current state.
This may be obvious for metrics based monitoring such as queries per second or the number of instances, but this can also be applied to failures and outages. Spotting trends is much more difficult for point in time monitoring than it is for trend monitoring, but both are just as important.
Known knowns and known unknowns
Choosing what to monitor in the early stages of deployment is fairly difficult because a new system behaves in unexpected ways. For this reason it’s important to select the items that are of concern of failing AND are critical to the overall user experience. Beyond the expected failures, identify the items that are critical and are uncertain of failure. Don’t purely focus on the things you know will work.
Monitoring sprawl has a close cousin of monitoring overload. Whether the overload is system impacting or is too much to consume, overload is cumbersome and disconcerting. Resist the urge to “monitor everything” because the noise-to-signal ratio will be high and spotting issues will be difficult. Not only do systems and the users of them constantly change, but there is always something to learn with every system. It is always best to iterate, learn, grow and repeat.
Monitor for more than one thing
It may be tempting to monitor for one type of fault in the system, complete failure being the most common fault monitored in early deployments. Consider items such as slowness, security concerns, data or transaction anomaly, workflows or configuration drift to name just a few.
Be sure to consider how your application will behave and how it will fail. Can it handle a gazillion requests per second? What if the configuration file is corrupt? Is 42ms latency acceptable? It’s almost never one thing that fails a system and it’s almost never one thing that fails an individual component.
Author’s Note: I’m sure that as I write this monitoring series I will reflect back on this specific topic. As I happen upon core principles I feel are important I will alter this blog post with those principles.