Of course. Here is a detailed technical document for the fourth proposed feature enhancement: "Telemetry
Support for Histograms and Summaries."
Technical Specification: Histogram and Summary Metrics in Foundation.Telemetry
Document Version: 1.0 Author: AI Assistant Status: PROPOSED
1. Overview
This document specifies an enhancement to the Foundation.Services.TelemetryService
to add native support for two powerful metric types: Histograms and Summaries. The current TelemetryService
primarily supports counters and gauges, which are excellent for tracking event occurrences and point-in-time values. However, to understand the performance characteristics and distributions of critical operations, more advanced statistical metrics are required.
This enhancement is driven by the needs of the DSPEx
project, where understanding the distribution of LLM latencies and evaluation scores is often more important than just tracking the average. For example, knowing that the p99 latency for an LLM call is 5 seconds is a critical SLO (Service Level Objective) that cannot be monitored with a simple gauge or counter.
This proposal introduces Telemetry.emit_histogram/3
and Telemetry.emit_summary/3
to the public API and details the necessary internal changes to the TelemetryService
to aggregate and expose this statistical data.
2. Problem Statement & Use Case
A DSPEx.Evaluate
run completes on a dev set of 1,000 examples. The final average score is 0.85
. This single number hides a crucial story:
- Did 85% of the examples score
1.0
and 15% score0.0
(a bimodal, spiky distribution)? - Or did most examples score between
0.8
and0.9
(a tight, normal distribution)?
Answering this requires understanding the distribution of the scores. Similarly, when monitoring an LLM client:
- An average latency of 800ms might seem acceptable.
- However, if the p99 latency is 7,000ms, it means 1% of users are experiencing extremely slow responses, indicating a significant performance problem.
The current TelemetryService
cannot capture this distributional data. This enhancement will add that capability.
3. Proposed API Changes
The public API in Foundation.Telemetry
will be extended with two new functions.
3.1. emit_histogram(event_name, value, metadata)
This function is used to record observations that will be aggregated into histogram buckets. Histograms are ideal for understanding the distribution of values, such as request latencies.
New Signature:
# in foundation/telemetry.ex
@spec emit_histogram(event_name :: [atom()], value :: number(), metadata :: map()) :: :ok
event_name
: The hierarchical name of the metric.value
: The numerical observation to record (e.g., a response time in milliseconds).metadata
: Additional tags for filtering/grouping (e.g.,%{provider: :openai, model: "gpt-4o"}
).
Example Usage (from DSPEx
):
# Inside the LM Client after a successful API call
duration_ms = # ... calculate duration ...
Telemetry.emit_histogram(
[:dspex, :lm, :request_latency_ms],
duration_ms,
%{provider: :openai, model: "gpt-4o-mini"}
)
3.2. emit_summary(event_name, value, metadata)
(Alternative/Optional)
Summaries are similar to histograms but are calculated on the client side and are more focused on quantiles. While histograms are generally more flexible on the backend, a summary can be useful for pre-aggregated quantile data. We propose adding it for completeness.
New Signature:
# in foundation/telemetry.ex
@spec emit_summary(event_name :: [atom()], value :: number(), metadata :: map()) :: :ok
Note: The internal implementation will likely be very similar to emit_histogram
, but the aggregation logic might differ, focusing on calculating and storing specific quantiles (e.g., p50, p90, p99). For simplicity, we will focus the rest of this document on the histogram
implementation, as it is the more versatile of the two.
4. Internal Implementation Details
This enhancement requires significant changes to the TelemetryService
GenServer and its internal state.
4.1. TelemetryService
State Enhancement
The metrics
map within the server_state
will now need to differentiate between metric types.
New server_state
Structure:
# in foundation/services/telemetry_service.ex
@type server_state :: %{
metrics: %{
# Example structure
[:dspex, :lm, :requests_total] => %{type: :counter, value: 1234, ...},
[:dspex, :system, :active_workers] => %{type: :gauge, value: 10, ...},
[:dspex, :lm, :request_latency_ms] => %{
type: :histogram,
buckets: %{100 => 5, 250 => 20, 500 => 50, 1000 => 15, :infinity => 2},
sum: 123450.0,
count: 92,
...
}
},
# ... other state fields
}
- A
:type
key is added to each metric entry. - For
:histogram
, the value is a map containing:buckets
: A map where keys are the upper bound of a bucket (e.g.,100
for 0-100ms) and values are the count of observations in that bucket. An:infinity
key catches all values above the highest defined bucket.sum
: The sum of all observed values.count
: The total number of observations.
4.2. New handle_cast
Clause for Histograms
The TelemetryService
needs to handle the new event type.
# in foundation/services/telemetry_service.ex
# Modify the record_metric function to handle different types
defp record_metric(event_name, %{histogram: value} = measurements, metadata, state) do
# This is a histogram observation
update_histogram(event_name, value, measurements, metadata, state)
end
# ... other record_metric clauses for :counter, :gauge ...
defp update_histogram(event_name, value, measurements, metadata, %{metrics: metrics} = state) do
# Define default histogram buckets or get them from config
default_buckets = [100, 250, 500, 1000, 2500, 5000, 10000]
# Find the correct bucket for the new value
bucket_key = Enum.find(default_buckets, :infinity, fn upper_bound -> value <= upper_bound end)
# Get the current histogram or create a new one
current_histogram = Map.get(metrics, event_name, %{
type: :histogram,
buckets: %{},
sum: 0.0,
count: 0,
timestamp: System.monotonic_time(),
metadata: metadata
})
# Update the histogram's state
new_histogram =
current_histogram
|> Map.update!(:buckets, &Map.update(&1, bucket_key, 1, fn count -> count + 1 end))
|> Map.update!(:sum, &(&1 + value))
|> Map.update!(:count, &(&1 + 1))
|> Map.put(:timestamp, System.monotonic_time())
%{state | metrics: Map.put(metrics, event_name, new_histogram)}
end
4.3. Exposing Aggregated Statistics via get_metrics
The get_metrics
function should not return the raw bucket data. Instead, it should calculate and return meaningful statistics.
# in foundation/services/telemetry_service.ex
def handle_call(:get_metrics, _from, %{metrics: metrics} = state) do
# Transform raw metrics into consumable statistics
stats =
metrics
|> Enum.map(fn {event_name, metric_data} ->
case metric_data.type do
:histogram ->
{event_name, calculate_histogram_stats(metric_data)}
# ... other types
_ ->
{event_name, metric_data}
end
end)
|> transform_to_nested_structure() # Your existing function
{:reply, {:ok, stats}, state}
end
defp calculate_histogram_stats(histogram_data) do
# This is a complex function. A real implementation might use a library
# like `:statistics` or a dedicated HDR Histogram implementation.
# For this spec, we'll outline the logic.
# Extract data for calculation
buckets = histogram_data.buckets
total_count = histogram_data.count
# Avoid division by zero
if total_count == 0, do: return_empty_stats()
# 1. Calculate Average
avg = histogram_data.sum / total_count
# 2. Calculate Percentiles (p50, p90, p99)
# This requires iterating through the sorted buckets
sorted_buckets = Enum.sort(buckets)
p50_rank = total_count * 0.50
p90_rank = total_count * 0.90
p99_rank = total_count * 0.99
p50 = find_percentile_value(sorted_buckets, p50_rank)
p90 = find_percentile_value(sorted_buckets, p90_rank)
p99 = find_percentile_value(sorted_buckets, p99_rank)
%{
type: :histogram,
count: total_count,
sum: histogram_data.sum,
avg: avg,
p50: p50,
p90: p90,
p99: p99,
# raw_buckets: buckets # Optionally include for debugging
}
end
# Helper to find percentile value from bucketed data
defp find_percentile_value(sorted_buckets, rank) do
Enum.reduce_while(sorted_buckets, 0, fn {upper_bound, count}, cumulative_count ->
new_cumulative = cumulative_count + count
if new_cumulative >= rank do
# The rank falls within this bucket
{:halt, upper_bound}
else
{:cont, new_cumulative}
end
end)
end
5. Configuration
The default histogram buckets should be configurable via the foundation
application environment.
Example config.exs
:
config :foundation, :telemetry,
default_histogram_buckets: [50, 100, 250, 500, 1000, 5000]
The TelemetryService
will read this configuration during init/1
to use for all histogram metrics.
6. Conclusion
Adding native support for histogram and summary metrics to the Foundation.Telemetry
service is a critical enhancement for building production-grade AI systems. It provides the necessary observability to move beyond simple averages and understand the true performance distribution of key operations.
For DSPEx
, this means:
- SLO Monitoring: We can now set and monitor SLOs on p99 latency for LLM calls.
- Performance Debugging: We can easily identify if a small number of outliers are skewing the average latency.
- Optimizer Analysis: We can analyze the full distribution of scores produced by an optimizer, not just the single best score, leading to more robust and reliable optimization.
This feature significantly elevates the operational maturity of any system built on foundation
, making it an essential addition to the library.