How can failure help with SRE? Isn’t that something to be avoided as much as possible?
Actually, when anything goes wrong with data management or the operational pipeline within an enterprise infrastructure, it is essential to understand that failure is a critical part of the process, helping teams learn and evolve. This becomes part of the gradual culture transformation that helps users adopt new features and use new processes to their fullest effect.
This is where SRE can, by allowing mistakes to happen, can ensure they never happen again. Teams are encouraged create processes called “faultless postmortems.” Also known as post-incident reviews, these are key analyses that should be built into SRE practices and can facilitate the SRE adoption journey. People are not held at fault for mistakes made our undesirable outcomes. Instead, when an attempt either goes poorly or simply doesn’t work out, the scenario can be reviewed without punishing those involved with the desire to find gaps or weak spots in operational processes in order to fill them in and shore them up.
This requires ongoing process monitoring so data is always being gathered to allow for easier and faster detection and resolution of incidents. This streamlined resolution strategy helps SRE teams free up time and resources, which allows for better monitoring, allows better resolution, and creates a cycle that constantly improves on itself.