TELEMETRY SAMPLING GUIDE

Documentation for TELEMETRY_SAMPLING_GUIDE from the Foundation repository.

Foundation Telemetry Sampling Guide

This guide explains how to use telemetry sampling to maintain observability in high-throughput systems without overwhelming your monitoring infrastructure.

Overview

Telemetry sampling allows you to:

Reduce telemetry overhead in high-volume scenarios
Maintain statistical accuracy while processing fewer events
Dynamically adjust sampling based on system load
Apply different strategies for different event types

When to Use Sampling

Consider sampling when:

Your system processes thousands of events per second
Telemetry overhead becomes significant (>1% of processing time)
Your monitoring infrastructure struggles with data volume
You need to control monitoring costs

Sampling Strategies

1. Random Sampling

Randomly samples a percentage of events.

Foundation.Telemetry.Sampler.configure_event(
  [:my_app, :high_volume, :event],
  strategy: :random,
  rate: 0.1  # Sample 10% of events
)

Use when: You want a representative sample of all events Pros: Simple, unbiased, predictable overhead reduction Cons: May miss rare events

2. Rate Limiting

Limits events to a maximum per second.

Foundation.Telemetry.Sampler.configure_event(
  [:my_app, :api, :request],
  strategy: :rate_limited,
  max_per_second: 1000
)

Use when: You need to cap absolute telemetry volume Pros: Predictable maximum load on monitoring systems Cons: May miss bursts of activity

3. Adaptive Sampling

Dynamically adjusts sampling rate based on volume.

Foundation.Telemetry.Sampler.configure_event(
  [:my_app, :cache, :operation],
  strategy: :adaptive,
  rate: 0.5,  # Initial rate
  adaptive_config: %{
    target_rate: 100,  # Target events per second
    adjustment_interval: 5000,
    increase_factor: 1.1,
    decrease_factor: 0.9
  }
)

Use when: Load varies significantly over time Pros: Maintains consistent telemetry volume Cons: More complex, sampling rate changes

4. Reservoir Sampling

Maintains a fixed-size sample over time.

Foundation.Telemetry.Sampler.configure_event(
  [:my_app, :user, :action],
  strategy: :reservoir,
  reservoir_size: 1000  # Keep last 1000 samples
)

Use when: You need a fixed sample size regardless of volume Pros: Guaranteed sample size, good for statistical analysis Cons: Older events get displaced

5. Always/Never

Simple on/off strategies.

# Always sample errors
Foundation.Telemetry.Sampler.configure_event(
  [:my_app, :error],
  strategy: :always
)

# Never sample debug events in production
Foundation.Telemetry.Sampler.configure_event(
  [:my_app, :debug],
  strategy: :never
)

Configuration

Application Config

Configure default sampling in your config files:

# config/config.exs
config :foundation, :telemetry_sampling,
  enabled: true,
  default_strategy: :random,
  default_rate: 1.0,  # 100% by default
  
  event_configs: [
    # High-volume events
    {[:my_app, :cache, :get], strategy: :random, rate: 0.01},
    {[:my_app, :http, :request], strategy: :random, rate: 0.05},
    
    # Rate limit background jobs
    {[:my_app, :job, :processed], strategy: :rate_limited, max_per_second: 100},
    
    # Always sample errors and warnings
    {[:my_app, :error], strategy: :always},
    {[:my_app, :warning], strategy: :always},
    
    # Adaptive sampling for variable load
    {[:my_app, :api, :graphql], strategy: :adaptive, target_rate: 500}
  ]

# config/prod.exs
config :foundation, :telemetry_sampling,
  enabled: true,
  default_rate: 0.1  # More aggressive sampling in production

Runtime Configuration

Adjust sampling dynamically:

# Increase sampling during debugging
Foundation.Telemetry.Sampler.configure_event(
  [:my_app, :suspicious, :activity],
  strategy: :random,
  rate: 1.0  # Temporarily sample everything
)

# Check current configuration
stats = Foundation.Telemetry.Sampler.get_stats()
IO.inspect(stats.event_stats)

Using Sampled Events

Direct Usage

# Only emits if sampled
Foundation.Telemetry.Sampler.execute(
  [:my_app, :event],
  %{duration: 100},
  %{user_id: 123}
)

# Check if event should be sampled
if Foundation.Telemetry.Sampler.should_sample?([:my_app, :event]) do
  # Do expensive telemetry processing
  :telemetry.execute([:my_app, :event], measurements, metadata)
end

Using SampledEvents Module

defmodule MyApp.Service do
  use Foundation.Telemetry.SampledEvents, prefix: [:my_app, :service]
  
  def process_request(request) do
    # Automatic sampling
    span :process_request, %{request_id: request.id} do
      # Your code here
      do_processing(request)
    end
  end
  
  def handle_event(event) do
    # Conditional emission
    emit_if event.important?, :important_event,
      %{timestamp: System.system_time()},
      %{event_type: event.type}
      
    # Deduplication
    emit_once_per :duplicate_warning, :timer.minutes(5),
      %{count: event.duplicate_count},
      %{original_id: event.id}
  end
end

Best Practices

1. Sample by Importance

# High sampling for errors and anomalies
config :foundation, :telemetry_sampling,
  event_configs: [
    {[:my_app, :error], strategy: :always},
    {[:my_app, :warning], strategy: :random, rate: 0.5},
    {[:my_app, :info], strategy: :random, rate: 0.1},
    {[:my_app, :debug], strategy: :random, rate: 0.01}
  ]

2. Consider Event Correlation

When sampling related events, use consistent sampling decisions:

defmodule MyApp.RequestHandler do
  def handle_request(request) do
    # Decide once for the entire request
    should_sample = Foundation.Telemetry.Sampler.should_sample?(
      [:my_app, :request], 
      %{request_id: request.id}
    )
    
    if should_sample do
      emit_start_event(request)
      result = process_request(request)
      emit_stop_event(request, result)
    else
      process_request(request)
    end
  end
end

3. Monitor Sampling Effectiveness

defmodule MyApp.Monitoring do
  def check_sampling_health do
    stats = Foundation.Telemetry.Sampler.get_stats()
    
    for {event, event_stats} <- stats.event_stats do
      if event_stats.total > 10_000 and event_stats.sampling_rate_percent > 50 do
        Logger.warning("""
        High-volume event #{inspect(event)} has high sampling rate.
        Consider reducing from #{event_stats.sampling_rate_percent}%
        """)
      end
    end
  end
end

4. Test with Sampling

Ensure your tests work with sampling:

defmodule MyApp.Test do
  use ExUnit.Case
  
  setup do
    # Disable sampling in tests for predictability
    original = Application.get_env(:foundation, :telemetry_sampling)
    Application.put_env(:foundation, :telemetry_sampling, enabled: false)
    
    on_exit(fn ->
      Application.put_env(:foundation, :telemetry_sampling, original)
    end)
  end
end

Performance Considerations

Overhead Measurement

defmodule MyApp.TelemetryBenchmark do
  def measure_sampling_overhead do
    # Without sampling
    no_sampling_time = :timer.tc(fn ->
      for _ <- 1..100_000 do
        :telemetry.execute([:test, :event], %{value: 1}, %{})
      end
    end) |> elem(0)
    
    # With sampling at 10%
    {:ok, _} = Foundation.Telemetry.Sampler.start_link()
    Foundation.Telemetry.Sampler.configure_event(
      [:test, :event],
      strategy: :random,
      rate: 0.1
    )
    
    sampling_time = :timer.tc(fn ->
      for _ <- 1..100_000 do
        Foundation.Telemetry.Sampler.execute(
          [:test, :event], 
          %{value: 1}, 
          %{}
        )
      end
    end) |> elem(0)
    
    IO.puts("No sampling: #{no_sampling_time}μs")
    IO.puts("With 10% sampling: #{sampling_time}μs")
    IO.puts("Overhead: #{(sampling_time - no_sampling_time * 0.1) / 1000}μs per event")
  end
end

Memory Usage

The sampler maintains counters and rate limiters in memory:

~100 bytes per unique event type
Rate limiters reset every second
Reservoir sampling stores N events in memory

Troubleshooting

Events Not Being Sampled

Check if sampling is enabled:

stats = Foundation.Telemetry.Sampler.get_stats()
IO.inspect(stats.enabled)

Verify event configuration:

# Check specific event
Application.get_env(:foundation, :telemetry_sampling)[:event_configs]
|> Enum.find(fn {event, _} -> event == [:my_app, :event] end)

Check sampling statistics:

stats = Foundation.Telemetry.Sampler.get_stats()
IO.inspect(stats.event_stats[[:my_app, :event]])

High Memory Usage

If the sampler uses too much memory:

Reduce reservoir sizes
Limit the number of unique events
Use more aggressive sampling rates
Reset statistics periodically:

# Reset every hour
Process.send_after(self(), :reset_sampler, :timer.hours(1))

def handle_info(:reset_sampler, state) do
  Foundation.Telemetry.Sampler.reset_stats()
  Process.send_after(self(), :reset_sampler, :timer.hours(1))
  {:noreply, state}
end

Integration with Monitoring Systems

Adjusting Metrics for Sampling

When using sampling, adjust your metrics:

defmodule MyApp.Metrics do
  def request_rate(sampled_count, sampling_rate) do
    # Extrapolate actual rate
    sampled_count / sampling_rate
  end
  
  def percentile_latency(sampled_latencies) do
    # Percentiles remain accurate with random sampling
    calculate_percentile(sampled_latencies, 0.95)
  end
end

Grafana Dashboard Adjustments

# Adjust for 10% sampling rate
rate(http_requests_total[5m]) * 10

# Percentiles remain accurate
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

Example: Production Configuration

# config/prod.exs
config :foundation, :telemetry_sampling,
  enabled: true,
  default_strategy: :random,
  default_rate: 0.1,
  
  event_configs: [
    # Critical business metrics - higher sampling
    {[:api, :payment, :processed], strategy: :random, rate: 1.0},
    {[:api, :user, :signup], strategy: :random, rate: 1.0},
    
    # High-volume, low-value events
    {[:cache, :hit], strategy: :random, rate: 0.001},
    {[:cache, :miss], strategy: :random, rate: 0.01},
    
    # Rate limit background noise
    {[:health, :check], strategy: :rate_limited, max_per_second: 10},
    {[:metrics, :collected], strategy: :rate_limited, max_per_second: 100},
    
    # Errors and warnings - always capture
    {[:error], strategy: :always},
    {[:warning], strategy: :random, rate: 0.5},
    
    # Adaptive for variable load
    {[:api, :request], strategy: :adaptive, 
      adaptive_config: %{target_rate: 1000}},
    
    # Debug events - mostly disabled
    {[:debug], strategy: :random, rate: 0.0001}
  ]

This configuration balances observability with performance, ensuring critical events are captured while preventing telemetry overload.