Site reliability engineers are responsible for maintaining system observability, and while detailed data collection tools can help, they can also hinder visibility if not used properly.
This was a major theme among presentations by expert site reliability engineers at this month’s SRECon. Namely, it is not about the amount of data collected; it’s about the extent to which it’s used to serve the business, keep systems running smoothly, and keep team members informed. Observability, a term that has supplanted IT monitoring in cloud-native environments, refers to a practice in which systems can be effectively interrogated to troubleshoot or prevent problems in real time, with an emphasis on overall user experience rather than individual system performance. Components.
Making good use of observability data starts with asking the right questions, aligned with the needs of the organization, according to presenters from software provider SRE Blameless, who showed examples of their internal dashboards that track the reliability of the service based on business priorities.
“As SRE leaders, you can often be seen as the bearer of bad news, especially for management,” said Christina Tan, member of the Blameless strategy team. “Understand business needs [means] that instead of being viewed as a cost center, for SRE teams, you can show how you are contributing to business goals and company growth.”
The places within each organization where system reliability needs to be improved may seem endless, but SREs should prioritize the most important goals. Aligning observability data collection with specific goals will also help SREs present a more useful set of metrics to developers and business leaders.
“When companies invest in incident resolution, they may still have the same number of incidents, but the severity of the impact on customers will decrease significantly,” said Mindy Stevenson, director of engineering at Blameless, during of the SRECon presentation. “And so maybe instead of the number of incidents, the severity of the incident is a better measure.”
Observability starts with strong queries
The flawless presenters divided the prioritization questions into four categories: change management, monitoring and detection, incident management, and continuous improvement. The size and maturity of the organization will dictate the questions SREs should ask when collecting data. For example, a mature enterprise SaaS company may focus on satisfying an existing customer base and minimizing risk, while a small, growth-stage startup may foster rapid change and respond to issues as they arise. and as they occur.
Prioritization can help, but SREs also need to be aware that the metrics they choose to focus on can have unintended consequences, Stevenson said.
For example, setting a lower goal for cycle time in software delivery – the time it takes a team to complete a task or project – may mean that a team feels compelled to move faster and to be less careful and diligent in code reviews.
“The potential for negative consequences or outcomes should not prevent us from creating measures,” Stevenson said. “[But] it is worth taking the time to think through each of these measures in detail, as they can have impacts.”
Once SREs have defined metrics and collected data, they can perform a gap analysis to determine which areas need the most improvement and demonstrate this to the business. While they should be able to show improvements and successes, gathering data can help SREs signal the need for improvement without necessarily obscuring it either, according to Stevenson.
“It creates transparency and a willingness to admit when things aren’t perfect,” she said. “When you have everyone in the organization – from engineering teams to the VP of sales – with access and visibility on the same dashboard, [it] creates a common understanding and the ability to have a conversation, with supporting data, about why we would focus on one area or another.”
Campspot narrows the aperture on the Observability Spotlight
Another SRECon presenter compared the proliferation of observability tools to light pollution in the night sky – there can be too much light on problems in too many directions. For Campspot, a Grand Rapids, Michigan-based company that provides online reservation services for campsites and RV parks, standardizing on a distributed tracking tool from Honeycomb.io that all members of the engineering organization could use provided much-needed guidance.
“We were trying to get all of our Java application metrics into Prometheus, and it just wasn’t doing the job for us,” said Kristin Smith, DevOps Services Team Lead at Campspot. “It was hard to see how our growing app ecosystem was talking to each other, and we ran into issues with many new [college graduates] and bootcamp graduates who had never looked at application metrics before, and suddenly they had to learn PromQL on top of everything else.”
Honeycomb’s distributed tracing tool provided data on application relationships and the ability to track customers as they moved through the online reservation system. Additionally, the company has started sending all alerts about computer performance issues through PagerDuty.
“No more secret alerts in developer emails because they didn’t want to scare anyone off,” Smith said. “No, we need to know what’s going on. »
Campspot’s SREs also decided to set “pretty good” service level objectives (SLOs) when revamping its observability instrumentation, although Smith said if it had to do it again, it would have involved service leaders. ‘company, especially sales managers, earlier in this deal.
“I put together week two of this Honeycomb implementation, which was supposed to be our SLO workshop, and I was so excited,” she said. “All the sales people came into the room and said, ‘What do you mean we’re not talking availability!?'”
Smith’s intentions to prioritize metrics other than availability — including performance, customer experience and acceptable levels of failure within company systems — were good, she said. , but could have been better explained to sellers.
“If I had started with our sales team and said, ‘Hey, I know you guys don’t necessarily care how all the sausages are made, but we’re looking at moving to a system where we make deals with our customers telling them exactly what we want to give them for each of these workflows instead of just uptime,” that team would have taken it and sold it to every customer before we even had signed a deal with Honeycomb,” Smith said.
Christine SmithDevOps Services Team Lead, Campspot
Instead, Smith ran into a sort of organizational technical debt due to the way surveillance and warning have been implicitly tied to security, prosperity, and job security in many businesses.
“We have to be careful when we talk about refusing to alert that we’re very, very, very clear that we’re not trying to go back to cloudiness,” she said. “We just try to lower it enough to focus on what we’re lighting [that which] is most important and not just about the things that are symptoms.”
A Data Visualization Call to Action for SREs
Enterprise engineers — and the vendors who sell them observability products — need to improve data visualization, said Dan Shoop, senior SRE at location-based advertising firm GroundTruth in New York.
“It’s very popular for observability platform vendors to produce lots of graphs, but we need … multivariate observability visualizations,” Shoop said in a SRECon presentation. “In other words, including several different types of [high] cardinality data in a particular chart rather than having it all in separate cells.”
Other common pitfalls in data visualization design include the lack of common scales for multivariate data represented in the same graphs, which can create misleading assumptions; imprecise use of data such as percentages and averages; “chartjunk”, such as color shading that obscures what a chart is supposed to represent; truncated or torn graphics; and missing scales, keys, attributions, and other critical explanations. According to Shoop, accompanying words can be just as important as images for effective data visualization.
“Engineering and science is about repeatability,” Shoop said. “[Linking directly to an original chart for attribution] would allow other engineers to double-check our work but also perhaps steal a good model for reuse.”
To demonstrate this point, Shoop showed a graphic that another SRECon presenter had used in a previous presentation. The chart’s style was a pencil drawing, but it layered multiple data points on top of each other to create a more meaningful whole and included simple verbal descriptions, Shoop said.
“Sometimes, really, all you need is just to be able to express [things] as you can,” he said. “You don’t really need a lot of tools sometimes, other than, like, a pencil.”
Beth Pariseau, Senior Writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.