Operations, as a discipline, is hard.

Not only is there the generally unsolved question of how to run systems well, but the best practices that have been found to work are highly context-dependent and far from widely adopted.

Observability is the cornerstone of Site Reliability Engineering (SRE). It provides real-time insights into system behaviour by accessing and processing telemetry data to gain a better understanding of the application behaviour at runtime.


Metrics, logs & traces

Metrics are a numeric representation usually expressed as a % of total usage of “something” over a specified interval of time.

Traces represent the full path of a request from the parent process down to the sub-routines and children involved in it. They show causality and help visualise bottlenecks.

Logs are records of events that happen at discrete points in time on specific systems. The data is stored in a log file, which has three formats: plain text, structured, and binary.

Achieving successful observability requires the deployment of appropriate Kubernetes monitoring tools and the implementation of effective processes for collecting, storing and analysing the three primary outputs.


The actual challenge

Kubernetes doesn’t expose metrics, logs and trace data in the same way traditional apps and VMs do. Kubernetes tends to capture data “snapshots”, or information captured at a specific point in the life cycle.

In a system where each component within every cluster records different types of data in different formats at different speeds, it can be extremely challenging to “stitch” everything together and make sense of it.

Furthermore, Kubernetes does not centralise logs out of the box so every app and cluster records data in its respective environment. Remember you need to not only record the logs of your applications, but the internal events of the Kubernetes control plane.


Combining monitoring & observability

Monitoring and observability are key parts of maintaining an efficient Kubernetes infrastructure.

  • Monitoring clarifies what is happening in the system.
  • Observability clarifies why the system behaves as it does.

So how do you choose from a range of possible solutions?

Broadly speaking, CTOs or Architects can start with a number of questions that are both business and technology related;

  1. What are the business goals, reduce costs, increase Time to Market releases, or performance?
  2. Does your team possess the capabilities to deploy complex architectures with several observability solutions at once or would you be better off using an “all-in-one” integrated technology like OpenTelemetry for example?

You cannot collect a trace of every single transaction because this would result in system overload and inefficient utilisation of resources. A certain level of wisdom is a prerequisite for implementing an observable platform.


"What" is important for you to know

Assuming all the above challenges are addressed early in your Monitoring & Observability infrastructure design, the WHAT to log, trace, and monitor problem has not only a direct impact on your cost of implementation, processing and storage, but it also gives you complimentary “bottom-up” view at the problem domain level.

USE with Love: The four golden signals

To maintain a robust and efficient Kubernetes infrastructure, focus on the following key metrics:

  • Utilisation: Track the usage of resources to ensure they are being effectively utilised without overuse or underuse.
  • Saturation: Monitor the frequency and types of resource saturation. For example, CPU-bound bottlenecks differ from memory-bound bottlenecks, which are common in Kubernetes clusters.
  • Errors: Pay attention to the types and levels of errors occurring in the system.
  • Latency: Measure the latency of operations, keeping in mind that different types of processes have varying latency implications.

You care about all these golden signals, whether your main applications are frontends that are sensitive to request-response bottlenecks, or data processing pipelines and backends where data “freshness” and resource saturation are crucial to observe your systems.

Applying the golden signals

Different latency metrics have varying implications. For instance, a 10-second latency in an HTTP request-response cycle is problematic for end-users, indicating a significant issue. However, the same latency for a background batch process may not be problematic.

By focusing on utilisation, saturation, errors, and latency, you can effectively monitor and manage your Kubernetes infrastructure, ensuring optimal performance and reliability.