Monitoring is reactive — you set thresholds and alert when they're breached. Observability is proactive — you instrument your system so you can answer arbitrary questions about its behaviour without deploying new code. Here's the difference in practice.
The three pillars
Metrics are numerical measurements over time — request rate, error rate, latency percentiles, resource utilisation. They're cheap to store and fast to query, but they tell you something is wrong, not why. Use Prometheus for collection and Grafana for visualisation. The four golden signals (latency, traffic, errors, saturation) should be on your primary dashboard.
Logs are timestamped events. They answer "what happened?" Structured JSON logs are dramatically more useful than text strings. Every log entry should have: timestamp, level, service name, trace ID, and structured context fields. Ship to a log aggregation system (Loki, CloudWatch, Datadog). Never log sensitive data.
Traces are the missing piece. A trace follows a single request through every service it touches — the only way to understand distributed system behaviour. OpenTelemetry has become the standard. Instrument your services with the OTel SDK, send to a backend like Jaeger, Tempo, or Honeycomb.
OpenTelemetry first
Use OpenTelemetry for all instrumentation, not vendor SDKs. OTel gives you vendor-agnostic instrumentation that can be routed to any backend. Switch from Jaeger to Honeycomb without re-instrumenting your application. Add trace IDs to your logs automatically. Correlate metrics, logs, and traces by trace ID.
SLOs, not just alerts
Service Level Objectives are error budgets — the acceptable level of unreliability over a time window. Instead of alerting when error rate > 1%, alert when you're burning your error budget too fast to meet your SLO. This reduces alert fatigue and focuses attention on user-impacting issues. Use Sloth or the Grafana SLO feature to define and track SLOs.
The golden path
When an incident occurs, the workflow should be: alert fires → look at dashboard to identify affected service → look at traces for that service during the incident window → identify the slow or failing span → look at logs for that span → find the root cause. If you can't follow this path, you have an observability gap.
Cost management
Observability data is expensive at scale. Sample traces (head-based or tail-based sampling). Set log retention policies — 30 days in hot storage, 1 year in cold storage. Use metric cardinality limits — high-cardinality labels (like user IDs) in Prometheus will destroy your infrastructure.