Charity Majors on Observability and Quality of Microservices

TL;DR

Observability refers to three different things: logs, metrics, and traces.
The problem with logs is that you have to know what to search for before you know what the problem is!
The problem with metrics is they are aggregated by time and you cannot break them down by high-cardinality dimensions (like user id for example).
Logs, metrics, traces, and events they each prematurely optimize one thing and comprise another thing based on a premise upfront.
You don’t want to write your observability data to many different places and copy-paste IDs from tool to tool trying to track down a single problem!
You want one source of truth and you want to be able to go from very high-level dashboards to very low-level data.
According to control theory definition, observability is the ability to understand what is going on in the inner workings of a system just by observing it from the outside.
Libraries that you build into your code should give you insights from the inside out (the software should explain itself).
Observability total cost should be 10 to 30 percent of the infrastructure cost.
You are either throwing away data at ingestion time by aggregating or you are throwing away data after that by sampling.
Observability can be incredibly cost-effective by using intelligent sampling.
Software engineers should write operable services and run them themselves!
Software engineers need to be on-call for their own systems. This is a way to support software engineers to build an observable and scalable system.
Every single alert you get should be actionable. Every time you get paged you should be like this is new, I don’t understand this (and not oh that again)!
Ops should stop being gatekeepers and blocking people. They have to stop a building castle and they have to start building a playground!
Every developer should be looking at prod every day. They should know what is normal, how to debug it, and how to get to a known state!
If management is not carving out enough project development time to get things fixed, no on-call situation will ever work!
SLOs (service-level objectives) define the quality of service that we agree to provide for users.
As long as you hit the SLO line, anything you do in engineering if fine! Everyone gets what they need, nobody feels micromanaged, and nobody feels completely abandoned!
SLOs help with defining how much time is enough for improving things!

WATCH HERE