By Tim Fraser
Author’s Note: Up until this point in the series we have covered a lot of theory and a thoughtful characterization of monitoring. If you are starting to feel like we need to get on with it, you are not alone. If you’ve decided to skip to this point in the series, ignoring the preceding background, I encourage you to stop and go back and read what was previously laid out as this foundation is helpful in understanding why the following decisions are made.
We’ve come to the point in this monitoring series where we can begin to reap the rewards. Thus far I’ve discussed a methodical approach to monitoring, defining a set of principles to guide the instantiation of data gathering. I’ve discussed the different dimensions of data gathering and how to collect the necessary data. Next, let’s discuss how to consume the collected data.
The intentions of any monitoring solution are not one of “pretty graphs” or “fascinating numbers”, they are to produce something tangible and in many cases something actionable. Perhaps you’d like to spot early indicators of failure. Maybe you desire the need to understand better how your application is being utilized by your users. Maybe both. Regardless, you will almost certainly desire output which shows you how your systems are behaving.
Human beings are exceptionally good at spotting patterns. This intuition (perhaps a survival strategy) allows us to take a seemingly random experience and apply our past knowledge so that we may predict an outcome through profiling. In addition to profiling, we can also see a periodicity or trend, using this to proactively predict what the outcome will follow the same trend. Some would say this is what holds us, humans, back to new experiences, but I digress.
The outputs from monitoring are the reward if you will. It is why we put all this effort into collecting the data in the first place. Let’s see how we can make use of the monitoring data collected.
Discovery
Rarely is monitoring a system possible with an out-of-the-box solution. I’m always skeptical of solutions that say they have built-in templates for every system and every product. While a packaged product is designed to behave in a predictable way (the way it was designed and built to run), the usage patterns will be different from implementation to implementation.
For this reason, every monitoring solution needs to have a mechanism for performing the discovery of the monitoring data. Sometimes referred to as data mining (I may be showing my age a bit), or sometimes referred to as ad-hoc querying, a discovery mechanism allows the consumer of the data to explore.
The field of data science has a lot to add here about the methodology of data exploration. I won’t go into the specifics of data science, but there is always a need to explore data in an iterative and multi-dimensional manner. Sometimes you are required to look for that needle in the haystack. Regardless, you will likely have a need to figure out a pattern at some point.
Discovery can be in the form of extractions that you play within another tool such as a spreadsheet, perhaps an RDBMS that you can query, perhaps a more modern approach using those data science techniques such as MapReduce jobs that produce various outputs. No matter the solution, providing a centralized (or seemingly centralized) solution allows for full data exploration.
The approach is one that cannot be predicted and requires access to all data in a dynamic manner. Look for solutions, or more likely a set of solutions, that allow you to explore data outside of pre-conceived mechanisms for data presentation. You’re going to need to explore at some point.
Reporting
Reporting, while defined by most dictionaries as a mechanism for displaying data (creative license applied), is likely something that allows the measurement of a set of data across some periodicity. Wow, that’s a mouthful! [Editor: that’s not an easy sentence to parse]
Consider a scenario where a large group of consumers needs to agree on the measurement of a data set in order to predict future trends or to ensure the success of past trends. In these cases, reporting is valuable because it allows this group to agree on a common “language” for further discussion and analysis.
The primary goal of reporting is consistency. Ad-hoc querying and data discovery is a fantastic way to learn about the health of any system, but reporting is a way to perform this same measurement over time. Reports and their data need to be compared to one another.
One of the most fundamental outcomes of reporting (and monitoring outputs in general) is the wide availability of the data beyond an operational team. You will never be able to anticipate what others will learn about an application when diving into monitoring data, and there is nearly no reason not to expose the data unless it is considered a security concern. Understanding the usage patterns of an application is rarely a security concern for owners and consumers of the applications, and if not a security concern there is no reason not to share the monitoring data.
Look for a solution that allows you to discover data, finding trends and patterns, and then package up those findings into a consumable format that others can rely upon. This can be in the form of saved searches or a generated and emailed report that can be delivered on some periodicity.
Dashboards
Dashboards…ahhh, the marketing holy grail! I’ll let you in on a little secret no one wants to admit, the commonly used term “single pane of glass” simply does not exist. It is impossible to get an in-depth knowledge of a system through a single pane. It’s too monodimensional to be useful or accurate.
While packaged reports that can be consumed via email or on a dashboard are quite helpful when made available to a broad audience, the concept of combining all data inputs from monitoring into a single location is an effort that rarely works, and when it does the data is distilled to a level where the key details are missing. The “single pane of glass” becomes dirty and fogged up almost immediately when someone asks the simple question “why?”.
For this reason, I think that a multi-paned approach is much more valid. While the dashboard may be expressed in a single screen of data much like the dashboard of a car, it is much more likely that each pane will be derived from a different data set.
Using the car example: a dashboard has a speedometer, a gas gauge, a temperature gauge and a red light that is either on or off. This is a fantastic multi-paned view, giving the consumer of this data a quick glance into the health of the system with an agreed-upon set of parameters, but it is not a “single pane of glass” as some may incorrectly insist. This is multiple panes brought together in order to present a high-level set of standardized metrics, each pane from a different source and on a different scale.
Don’t get me wrong, dashboards are great! Beyond the pretty marketing webpage of monitoring solutions, they provide a mechanism for quickly determining the health and usage of a system. Even for monitoring solutions that ultimately do not achieve a good dashboard, just making the attempt in building one results in an exercise of understanding the system and expressing the current state in a manner in which many people can agree on the elements within.
The only word of caution I will make on dashboards is that this should not be part of the initial set of requirements. Dashboards are only valuable if they can be agreed upon, built and then left with few modifications. Too often I see someone express their desires of a dashboard without first understanding the data and the system, only to have to reimplement the dashboard multiple times until they gain the understanding.
Look for tools that allow you to display standardized metrics. While discovery and reporting are still required, once the discovery is complete (for this go-around) and everyone understands the reports, move this to a broadly consumable location that everyone involved in the creation of the system can access. Simply put: look for a dashboarding solution as the last step, not the first.
Integrations
Using monitoring data beyond the confines of the monitoring solutions is a great way to expand the impact of any monitoring efforts. Ideally, you’d like to have the results of monitoring come to you, rather than you to it. Staring at a dashboard is mesmerizing, but rarely an efficient use of one’s time.
The most common integration is alerting. Based upon some rules, an alert is sent to an operator so that they may don their cape and swoop in to save the day. While the superhero admin is not a common occurrence anymore, alerting is.
The chain of integration from data collection to alerting must be as short as possible. Adding any unnecessary (or perceived necessary) integrations points should be avoided. In cases where an alert (or an initiating rule) would be beneficial to multiple consumers, I recommend continuing with the direct alerting integration while additionally putting (publishing) the event on a message bus (i.e. pubsub) for broader consumption. A mechanism like this can start to create a rather amazing amount of insights to a broad set of applications (consumers).
Some of the most advanced rule evaluations are typically implemented in a CEP (complex event processing) engine. There are a number of open- and closed-source solutions to pick from, but most operate in the same way by evaluating a set of rules and retaining some sort of data heuristic within a predefined window. It is far beyond the scope of this blog post to detail the benefits or inner workings of a CEP solution. See the references section for a great podcast on the rule engine topic (it touches on CEP).
When thinking about the integrations with your monitoring data, consider the automation opportunities that can exist given a clean and detailed input mechanism. For the most advanced integrations that start to achieve the “self-healing” we were promised by IBM back in the late 1990s, we can look to the latest advancements in AIOps.
Emerging: AIOps
When we talk about event correlation and complex event processing (CEP) we often consider the notion of either: a) needing to limit the number of events due to capacity issues, b) employ an army to review a large number of events or c) implement some automated processing solution to interpret the events at scale. This is not a new concept, and automated processing is one that most event correlation solutions have provided for years in one form or another.
But the signal to noise ratio remains quite ominous with the noise only being reduced if rules are put into place to rationalize these events. If a new application or event type is deployed, new rules are required. Given the agility of development teams and their ability to both enhance existing products and produce new ones at an amazing pace, the event profile changes nearly constantly and is impossible to keep up with.
There are some events that may appear rather static on the surface. Events such as CPU utilization or response time of an application are at the core of every system, but these trailing indicators do not allow for any predictive, or proactive, resolution of would-be indicators of impending doom.
Enter AIOps or Algorithmic IT Operations. AIOps solutions promise to reduce the noise while simultaneously increasing the signal, making the notion of highly inefficient operators a thing of the past. I say “promise” , because this is an emerging space and is reminiscent of past solutions that never delivered on their marketing slideware.
Still, given the rapid advancements in ML (machine learning) along with the increased set of skills being acquired quite quickly in the technology industry and beyond, it is likely that an AIOps solution will finally deliver on that promise. If only I might also receive my flying car that I was promised 20 years ago!!
Keep an ear to the ground and start looking for the keyword AIOps. The promise is huge and the platform that this will be delivered on is riper than ever. Once I learn more about this space I’ll write a separate blog on my findings.
Conclusion
I hope that you see the benefits of defining how and when you will use the data you’ve so carefully collected. If we truly start with the end in mind, working backward from the result, defining your required outputs first may inform you best on how you should proceed from the starting line.
References
Superior pattern processing is the essence of the evolved human brain
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4141622/
A rather detailed (but neat) research paper on the human ability to recognize patterns far better than most species.
SE-Radio Episode 299: Edson Tirelli on Rules Engines
http://www.se-radio.net/2017/08/se-radio-episode-299-edson-tirelli-on-rules-engines/
An excellent discussion of rules engines, including the subcategory of CEP.
Towards architecture-based self-healing systems
http://dl.acm.org/citation.cfm?id=582133
An exhausting example of the level that integrations with monitoring events can achieve.
An Introduction to AIOps: The Benefits of Algorithmic IT Operations (requires registration)
https://www.moogsoft.com/resources/aiops/white-paper/aiops-intro/
A bit of marketing material, but still relevant in teaching a bit about AIOps.
Market Guide for AIOps Platforms (paywall)
https://www.gartner.com/doc/3772124/market-guide-aiops-platforms
Gartner’s definition of AIOps. Given the emerging nature of this space, this may be the best information you can receive outside of marketing material.