Monitoring is the practice of understanding how software components run in remote environments. Observability (in a software context) is how well a software component can be understood by looking only at its external outputs.
By the end of this article you’ll understand what observability and monitoring are, why they’re important and strategies to implement them.
What Is Monitoring / Observability?
Many in the cloud software engineering community make a distinction between monitoring and observability, let’s take a moment to define them both and understand their differences and similarities.
Monitoring: Adding instrumentation to a software component so that bugs and issues can be diagnosed or operational work performed.
Observability: The ability to understand how a (software) system operates by reviewing its outputs.
The Three Pillars Of Monitoring
Monitoring discussions often frame on the so-called “three pillars”.
At this point, I must note that the three pillars has often come under scrutiny, for being a too simplistic, and tools-first approach. That said, regardless of the criticism, the three pillars is still a useful framework for understanding the different tools we have at our disposal in monitor our software systems.
So what are the “three pillars”?
- Logs — Logs are data emitted by a software component that provide in-depth understanding into how a software system works.
- Metrics — Metrics are aggregated values that allow a software component to be understood at a high level.
- Traces — Traces are typically used to monitor a full request lifecycle, even as it hops between software components. Traces show us how long a request takes for various steps in a process, helps to debug bottlenecks and identify usage patterns.
Why Learn Monitoring?
At this point you might still be unclear about what the relevancy is of monitoring, and why as a software engineer, you might need it? If you are wondering that, let’s take a moment
Monitoring skills are expected of software engineers — More companies are moving to the “you build it, you run it” model of software engineering, which dictates that the software engineers who build a system should also be the ones to monitor it. Under this model software engineers need to have an in-depth understanding of how their systems work, and how to debug them if and when they are needed.
Monitoring is a high demand skill — There is continuing and growing demand for software engineers to understand monitoring. Understanding how to instrument and debug services is a skill that will differentiate software engineers.
Monitoring helps us write better software — Running software in production is a unique experience. Our code fails in ways we never could imagine. Adding good monitoring helps us to gather data about how our software runs, and where it fails. Ultimately, by adding monitoring data we can write bettter better software in the long run.
Adding Monitoring To An Existing Service
Now that we’ve discussed more of what monitoring is, and why it’s useful, let’s start to address the topic of how we implement it.
Often times when we’re adding monitoring we’re applying it to a service that service already exists. If this is the case, there are a few places that we can start
- Create a dashboard — First up, creating a dashboard can help us understand the high level metrics of the system: How many requests does it receive? How often does it fail? etc.
- Log common failures — Once we understand our high level metrics, we will want to understand where our service is failing. Adding logs to these failure points gives us more understanding of the problem areas in our application.
- Log unknown errors — Now that we have all known errors, we’ll also want to ensure that we’re capturing all of the unknown errors, the ones that slip through the cracks.
- Create a Runbook — Runbooks are documents that detail how a system is to be operated, as we go through our service it’s useful to document queries that we run, and link to useful resources that might help others in future.
- Setup an alarm — Lastly, we will likely want to setup an alarm to ensure that our service is operating as expected.
One-Per-Service / Event Logging
When it comes to monitoring, or more specifically: logging, one of the first questions you might have is: “How should I add logs? What format should I add them, and how should I structure them?”.
After experimenting with different techniques myself, I eventually stumbled across a monitoring pattern that worked better than any other.
That pattern is “one-per-service” or “event logging”. The pattern is quite simple: you gather data into an object during the lifecycle of a request in a service, when the service fails to respond, or responds successfully, the event is emitted.
When you have all your log values bundled into a single event object, it’s more easy to analyse and understand what’s going on in your service.
Here is every article that I’ve written about Monitoring & Observability:
- You’re Alerting Wrong: The Why & How Of Setting An AWS Lambda Alarm Using Error Rate Percentages.
- How To Setup Monitoring / Observability On Existing Software (e.g. A Web API): A Practical 5 Step Guide.
- You’re Logging Wrong: What One-Per-Service (“Phat Event”) Logs Are and Why You Need Them.
- How To Get AWS Lambda Logs Into CloudWatch