By Brandon Knitter
Author’s Note: Up until this point in the series we have covered a lot of theory and a thoughtful characterization of monitoring. If you are starting to feel like we need to get on with it, you are not alone. If you’ve decided to skip to this point in the series, ignoring the preceding background, I encourage you to stop and go back and read what was previously laid out as this foundation is helpful in understanding why the following decisions are made.
The two most critical components of a monitoring solution are the inputs and the outputs. In this post, we will cover the inputs and how we store them for whatever period is required.
In general, collecting the inputs and metrics of monitoring should be focused on speed and accuracy, while other aspects of monitoring focus on consumption and representations that make the data meaningful. For this reason, making the generation of monitoring data easy for the implementer is critical for adoption. The monitoring generation mechanisms must be so lightweight that it does not impede the creator during development nor the user runtime.
When considering what data to collect, we attempt to classify the “data type” for each metric. Stealing from some of the well known SNMP data classifications, we can build a set of these types. Data collection is generally found in one of the following areas:
- Boolean — This will have a true or false state and may represent something working or not working, something up or down, something is running or not, or something that may be open or closed. Examples include web page response, API response, the process running, or maybe even processing turned on or off for specific transaction sets.
- Scaled Variable — A value that has a lower and upper bound and used to indicate Examples include the total amount of disk space utilized (as compared to the total amount of disk space available) or a system or service that has a state of up, down or somewhere in between (degraded).
- Counters — Increments over time and is used for velocity measurements when observed over varying time periods. Examples include the number of calls to an API, number of web page hits, or the number of times a database request is performed.
- Event — Demonstrates a specific state that has been accomplished. Examples include a transaction which has completed billing prior to fulfillment, the release of a new software version, or the execution of a specific function or method such as passing through a processing engine.
Any of these data types may have metadata associated with them to support, explain, or further expand the data. This may be included as part of an n-tuple of the primary monitoring data, or as a secondary data set associated with the original monitoring result.
In addition, some of these types may expose another type. For instance, two events may constitute a scaled variable such as the time to process a transaction. In this example, one event signifies the start of a transaction and another event signifies the end of a transaction.
When collecting monitoring statistics there are roughly two ways to gather this data: pull or push. Pull is where a centralized mechanism will go through a list of endpoints and gather data. Push is where each endpoints sends data to a centralized location. In both cases, the collection can be individual data points or a batch of data points.
When using the pull mechanism, a central repository must maintain a list of endpoints to retrieve data from. This retrieval can quickly reach a scaling limit with as few as 100 endpoints depending on the data set size and the frequency of data gathering. Techniques such as multi-threading the requests or a tiered data gathering approach can be utilized to increase scale, but these can also reach a scale limitation.
Conversely, a push mechanism leaves the data gathering responsibility to the endpoint. In this case, the endpoints themselves are aware of their upstream centralized destination and take responsibility for delivering the goods. Techniques such as tiered aggregation may still apply, but failures still happen, so a backup centralization point is sometimes required.
Regardless of the push versus pull mechanism, reliability cannot be ignored and for a push approach, a primary and secondary centralization point is recommended. Scalability also cannot be ignored, therefore using tiered centralization points is still recommended.
In this case of a push approach, it is normally recommended to use a “heartbeat” check at the centralization point. This requires the centralization point to keep track of which endpoints have delivered data within a given window and notifying an operator if an endpoint has not checked in recently.
Log aggregation is a common scenario where the push mechanism is typically utilized. In this scenario, the log entries are produced and centralized so frequently that pushing individual entries results in very high levels of overhead (on the network, the system, and elsewhere). Using batching techniques will alleviate the load placed on both an endpoint and centralization server by combining multiple entries into a single submission.
Further benefits of push include pre-processing of the data about to be centralized. Filtering and compression techniques can further reduce the load incurred with the data submission.
There is no one perfect way to centralize data. There are dependencies on the network, the data set (size and contents) to consider, and storage requirements. One of the largest requirements of a scalable centralization strategy is to distribute the work as much as possible.
A tiered approach (discussed in the previous section) will provide just that mechanism for scaling. In this case, the centralized monitoring solution will rely on a set of tiered instances to manage the eventual endpoints. This allows each tiered instance to manage a smaller set than if the centralized monitoring solution attempted to reach a large number of endpoints on its own.
This tiering does not come without its own complexities. Configuration of endpoints to utilize the correct tiered instance, configuration of tiered instances to “own” the correct (and evenly distributed) set of endpoints, and validation that all endpoints are providing data; these lead to a complex system to design and manage.
We must also consider the size of the centralized data set. Rarely is monitoring data kept in a “raw” state for long periods of time. Some systems do indeed have this requirement, but the vast majority do not. In cases where forensic analysis of the monitoring data requires individual data points, the raw data points must be stored.
In the case where monitoring data is helpful, but not required at precise fidelity, “rollups” can be performed by taking averages (or other formulas) of the data provided over longer periods of time. For instance, the wildly popular RRDtool¹ data storage solution provides a mechanism to summarize data over time.
Within an RRA (the data store behind RRDtool) the administrator can configure how much data is stored both in duration, and when to summarize. For example, you may configure the raw data to be stored for a month, then rolled up to daily averages for half a year, then rolled up to weekly averages for a year, then disposed of entirely beyond one year.
While this example is very specific to RRDtool, the philosophy is relevant beyond.
Additional data can be collected along with, or in addition to, monitoring that provides specific types of meta-data in order to quickly identify the relationship to the overall system.
Some of the more commonly found meta-data:
- Transaction ID — Provides a linkage between monitoring events across disparate systems. This allows for later insights into the flow of a transaction and typically includes historical evidence of previously accomplished steps in a flow. Example: which system a transaction is flowing, or has flowed, through.
- Marker — Identifies a related event that has been shown to correlate to the overall monitoring results. While this isn’t meta-data on an individual monitoring result, this does provide keen insights into correlation (and perhaps causation). Example: a software release date and time.
- Labels — Not exactly monitoring, but meta-data that will provide context to the monitoring result. This allows for later filtering.
When collecting monitoring data there are various points within a system that can emit (or be polled) for results. Given the nature of blog posts, it will be incredibly difficult to express the complexities of large systems or to identify every point of monitoring, instead, I will focus on the most common points of monitoring data generation.
Some of the common points of monitoring are shown in the above diagram are:
- User/System Request — A user or another system makes a request for your application. This can be through a user interface or a programmable interface (i.e. API). This information informs the usage patterns of your service’s consumers including velocity and functional hotspots.
- System Processing — Regardless of the requested functionality, there is always some amount of system processing. Capturing this information is similar to capturing the user or system request, this will inform you as to how your application is processing a given request.
- Transaction — Tracking a transaction (or sub-transaction) provides great detail into how long a transaction lives as well as how often this is required. This insight can provide insights into the duration and velocity of critical sections of your application.
- Transaction Flow — Similar to Transactions monitoring (described above), the flow represents a transaction that exists across multiple systems. These insights go beyond the simple functionality and can inform you as to how data flows as well as uncover unexpected behaviors across your system.
- Persistence — By far one of the most valuable insights for performance, monitoring the persistence (storing and retrieving) of data can bring to light valuable performance characteristics such as hotspots and data sharing.
- Integration — Exposing the reliance on functionality and systems outside of direct control, this monitoring helps define the strict (or loose) dependency on external functionality. This insight can show the velocity, cascading performance impact, and failure dependency (to name a few). This is most commonly found internally with microservices, and externally with nearly every connected service.
System vs Server
The world is changing, what we monitor today is unlike what we monitored years ago. My most recent experience with monitoring solutions has moved well beyond the value of individual server monitoring to one where monitoring the system as a whole brings the greatest benefit.
Take the example of monitoring a single server. If the disk fills, CPU spikes or memory runs low; what is the value of this data? In a small architecture on dedicated instances, this is incredibly valuable. But systems such as this are increasingly less and less common.
It is far more likely that a system is deployed in a public cloud where instances come and go both by desire (autoscaling) or not (instance failures). In these cases, the mind-shift is changing from caring less about the individual and caring more about the group as a whole.
Given the need to distribute applications and systems globally there is less importance on the instance. Workloads can (and are) shifted constantly for cost and performance reasons (to name a few). For this reason, monitoring the server instance is less important.
When an instance does have an issue (disk space, CPU contention, memory pressure) the result now is to simply report on this occurrence (gathering some forensics during the incident) and then replace the instance. Unless the replacement velocity is high (an alertable event) there is no need to involve an operator.
This new mindset allows for some fairly dynamic systems and in cases where a system spans many thousands of instances, this is likely the only value that will be derived. An operator is not going to manually debug each thousand of machines
Nowhere has this been more obvious than in the containerization efforts of solutions such as Docker. Control planes (e.g. Kubernetes, Swarm, Mesos, etc.) provide schedulers to handle the dynamic scale (up and down) as well as the replacement of failed (or degraded) instances. For this reason, individual instance (server) monitoring is valued far less than the system as a whole (i.e. is the scheduler doing what is expected).
What and how you collect monitoring data is always custom to your solution. There is no one right way to do things. Consider how you will collect data, the growth of your application (expected or unexpected), and how you intend to use the data over time. Once this is defined you can begin to define the locations within a system to monitor…and what to monitor.
(1) RRDtool — https://en.wikipedia.org/wiki/RRDtool