An Overview of Observability

TL;DR

  • Observability is the ability to understand what is going on in the inner workings of a system just by observing it from the outside.
  • Your software should explain itself and what is doing!
  • Pillars of observability are logs, metrics, traces, and events.
  • Logs are structured logging or non-structured textual data.
    • Used for auditing and debugging purposes.
    • Very expensive at scale.
    • Cannot be used for real-time computational purposes.
    • Hard to track across different and distributed processes.
    • You need know what to look for ahead of the time (know unknowns vs. unknown unknowns).
  • Metrics are time-series data (regular) with low cardinality.
    • Aggregated by time.
    • Used for real-time monitoring purposes.
    • Can take the distribution of data into account.
    • Enable service-level indicators (SLIs) and service-level objectives (SLOs).
    • CANNOT be broken down by high-cardinality dimensions (unique ids such user ids).
  • Traces are used for debugging and tracking requests across different processes and services.
    • Can be used for identifying performance bottlenecks.
    • Need to be sampled due to their very data-heavy nature.
    • Not optimized for aggregation.
    • Cannot precisely know about the distribution of data (detecting outliers).
  • Events are time-series (irregular) data.
    • Occur in temporal order, but the interval between occurrences are inconsistent and sporadic.
    • Used for reporting and alerting on important or critical events such as errors, crashes, etc.
  • Logs, metrics, and traces each prematurely optimize one thing and comprise another thing based on a premise upfront.
  • You do NOT want:
    • Writing duplicate data into three different places.
    • Copy-pasting IDs from tool to tool trying to track down a single problem!
    • Paying for three (four) different services doing almost the same thing!
  • You want:
    • One source of truth for your observability data.
    • Looking at high-level dashboards, spot anomalies, and zoom in to get detailed information as needed.
  • You are either throwing away data at ingestion time by aggregating or you are throwing away data after that by sampling.

Presentation