Closer Look: Observability

5 Min Read | April 16, 2020

 

In this article:

  • How observability and monitoring differ and also complement each other;
  • Why observability is getting more popular in modern IT environments;
  • How observability’s deeper focus on system behavior can give IT teams necessary context to solve new unknown issues.

As enterprise IT systems have become more complex and distributed due to cloud infrastructure, containers, serverless technology, an ever-growing footprint of applications and devices, IoT, SDN, open source development tools and more, the practice of performance monitoring has become far more nuanced.  In these modern IT environments, traditional monitoring practices centered on known issues aren’t enough. You will also frequently hear people talking about “observability.” The term has been attributed to the work of electrical engineer Rudolf Kalman.

There are plenty of opinions on how observability differs from or complements monitoring. In this blog post, we described it as a key characteristic of modern IT operations: “Observability differs from monitoring, in that it focuses on the development of the application and rich instrumentation so that operators can ask arbitrary questions about how the software works.”  One goal of observability is to discover the new unknowns.

Nancy Gohring, Senior Analyst at 451 Research, describes observability as a smart way to manage cloud-native technologies, in the below video: “The idea is to be able to flexibly dig into operations data collected about your systems in order to answer questions. It’s popular now in organizations which have adopted cloud native technologies that make your system more complex and dynamic, and where traditional monitoring doesn't work great.” 


Adopting observability practices can help you ask and answer questions like:

  • How well are the different parts of your distributed application working?
  • Can you understand the internal functioning of complex and interconnected systems using a small set of signals?
  • Can you pinpoint the specific performance variations that can cause unexpected issues in your application user experience?
  • Do you have the right signals to easily debug and restore a decentralized application?

 

For one more perspective, Cindy Sridharan goes into great detail on the subject on Medium:  “Monitoring is best suited to report the overall health of systems...and is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests.''  She goes on to explain that observability aims to provide highly granular insights into the behavior of systems and doesn’t necessarily have to be linked to an incident or user complaint.


Why do we need capabilities for observability right now?

As systems and applications have become more distributed and ephemeral due to cloud and software-defined infrastructure, we don’t always know if there’s a problem, much less what caused it:

  • Resources are changing frequently and behind the scenes in virtual and cloud environments; 
  • There are far more interdependencies between systems and components running across clouds and on-premise data centers;
  • APIs have enabled deep integrations between applications no matter where they live, creating a tangle of interconnected and fragile relationships; 
  • Visibility into all environments, all the time, is difficult to achieve.

At this stage of cloud computing maturity, IT and business executives have limited patience with frequent or long-lasting outages and application glitches that affect users for hours or even days. Monitoring and observability combined can help IT operations teams arrive at the root cause of incidents faster--and that is a critical factor in delivering business value. 

To get started on incorporating observability in your IT management and monitoring practices:

  • Broaden the data types collected: Observability requires more than traditional monitoring server and network metrics such as CPU utilization and latency. Include logs, traces, metrics and alerts from every infrastructure component to allow for new questions.
  • New tools: The complexity and fluidity of modern IT environments requires a deeper understanding of system and network behavior, which traditional monitoring systems using standard dashboards can’t deliver. You’ll need tools which can collect and combine data from many different on-premise, virtual and cloud environments and rapidly correlate that data to arrive at new insights. Watch our recent webinar to learn how OpsRamp can work as the digital command center to orchestrate data insights and event management.
  • New skills: IT operations personnel, especially SREs, will need skills in infrastructure automation including creating self-service tools, along with problem-solving capabilities and a strong interest in investigation. They must have experience with massive scale deployments, DevOps, and understand complex architectures involving microservices, cluster management, containers and cloud. Finally, the best observability pros will have a sharp business acumen to map the big picture of IT incidents to distinct user/customer needs.

Next Steps:

CTA-techtalks


Recommended posts