2024-11-03
Telemetry budgets: when to shed spans instead of pride
By Ara Lim
Tags: Observability · Cost · SRE
OpenTelemetry makes instrumentation easy—sometimes too easy. Teams emit generous spans, then wonder why storage invoices spike after launch. We teach a weekly ritual: rank services by span volume, identify redundant attributes, and shed duplicates at the edge.
The ritual starts with a simple histogram of span names. Anything occupying more than ten percent of volume without unique diagnostic value becomes a candidate for merge or drop. We keep exemplars for slow requests rather than full traces for every health check.
Second, we align shedding decisions with SLO reviews. If a service misses its budget, shedding is temporary while root causes are fixed. We document the expiry date on the shed rule so it does not become permanent darkness.
Third, we coach engineers to communicate shedding to support teams. Support should know which diagnostics temporarily lose granularity. A one-page addendum to the on-call guide prevents mystified escalations during incidents.