โ† Back to Test old20250712

TELEMETRY

Documentation for TELEMETRY from the Foundation repository.

Foundation Telemetry Architecture - Production-Grade Event-Driven Infrastructure

Executive Summary

The Foundation Telemetry System provides a comprehensive, event-driven infrastructure for observability, test synchronization, and system coordination. It bridges the gap between minimal telemetry wrappers and production-grade observability platforms while maintaining the elegant test synchronization patterns that eliminate Process.sleep anti-patterns.

๐ŸŽฏ Core Philosophy: Event-Driven Everything

The system should emit events for everything it does, making all operations observable and all async operations testable through event-driven synchronization rather than arbitrary delays.

Key Principles

  1. Every Operation Emits Events - All services, processes, and operations emit structured telemetry events
  2. Tests Subscribe to Events - Replace Process.sleep/1 with event-driven assertions
  3. Production Observability - Comprehensive metrics collection, alerting, and dashboards
  4. Deterministic Testing - Tests wait for specific events, not arbitrary time periods
  5. Service Integration - All Foundation services emit operational telemetry automatically

๐Ÿ—๏ธ System Architecture

Layer 1: Core Telemetry Infrastructure

Foundation.Telemetry (Enhanced)

Current: Basic :telemetry wrapper
Enhanced: Production-ready telemetry orchestrator

New Capabilities:

  • Structured metric types (counter, gauge, histogram, summary)
  • Event routing and filtering
  • Handler lifecycle management
  • Event correlation and causality tracking
  • Performance optimization with handler pooling

Foundation.Services.TelemetryService

Purpose: Centralized telemetry collection and aggregation service
Responsibilities:

  • Metric storage with configurable retention
  • Real-time metric aggregation
  • Alert evaluation and firing
  • Handler registration and health monitoring
  • VM metrics integration
  • Cross-service event correlation

Layer 2: Service Integration Telemetry

Automatic Service Instrumentation

Every Foundation service automatically emits operational telemetry:

Foundation.Services.CircuitBreaker:

# Automatic events
[:foundation, :circuit_breaker, :call_start] - %{service_id: id, timeout: ms}
[:foundation, :circuit_breaker, :call_success] - %{service_id: id, duration: ms}  
[:foundation, :circuit_breaker, :call_failure] - %{service_id: id, error: reason}
[:foundation, :circuit_breaker, :state_change] - %{service_id: id, from: state, to: state}
[:foundation, :circuit_breaker, :health_check] - %{service_id: id, healthy: boolean}

Foundation.Services.RetryService:

[:foundation, :retry, :attempt_start] - %{attempt: n, total_attempts: max}
[:foundation, :retry, :attempt_success] - %{attempt: n, duration: ms}
[:foundation, :retry, :attempt_failure] - %{attempt: n, error: reason}
[:foundation, :retry, :backoff] - %{delay: ms, strategy: atom}
[:foundation, :retry, :exhausted] - %{total_attempts: n, final_error: reason}

Foundation.Services.ConnectionManager:

[:foundation, :connection, :pool_created] - %{pool_id: id, size: n}
[:foundation, :connection, :connection_acquired] - %{pool_id: id, duration: ms}
[:foundation, :connection, :connection_released] - %{pool_id: id}
[:foundation, :connection, :connection_failed] - %{pool_id: id, error: reason}
[:foundation, :connection, :pool_health] - %{pool_id: id, active: n, idle: n}

Health Check Integration

[:foundation, :service, :health_check] - %{service: name, status: atom, metrics: map}
[:foundation, :service, :started] - %{service: name, pid: pid}
[:foundation, :service, :stopped] - %{service: name, reason: atom}
[:foundation, :service, :restarted] - %{service: name, restart_count: n}

Layer 3: MABEAM Multi-Agent Telemetry

Agent Lifecycle Events

[:foundation, :mabeam, :agent, :registered] - %{agent_id: id, capabilities: list}
[:foundation, :mabeam, :agent, :unregistered] - %{agent_id: id, reason: atom}
[:foundation, :mabeam, :agent, :health_check] - %{agent_id: id, healthy: boolean, metrics: map}

Coordination Protocol Events

[:foundation, :mabeam, :coordination, :started] - %{session_id: id, agents: list}
[:foundation, :mabeam, :coordination, :message_sent] - %{from: id, to: id, type: atom}
[:foundation, :mabeam, :coordination, :message_received] - %{from: id, to: id, duration: ms}
[:foundation, :mabeam, :coordination, :completed] - %{session_id: id, result: atom, duration: ms}
[:foundation, :mabeam, :coordination, :failed] - %{session_id: id, error: reason}

Layer 4: Advanced Analytics & Alerting

Foundation.Telemetry.Analytics

Real-time Metric Processing:

  • Time-series data collection
  • Moving averages and percentiles
  • Anomaly detection algorithms
  • Performance trend analysis
  • Cross-service correlation

Foundation.Telemetry.Alerting

Configurable Alert System:

  • Rule-based alert definitions
  • Escalation policies
  • Alert correlation and deduplication
  • Integration with external systems (PagerDuty, Slack)

๐Ÿงช Test Philosophy: Event-Driven Testing

The Problem with Process.sleep/1

Anti-Pattern:

# WRONG: Guessing how long an operation takes
CircuitBreaker.call(service_id, fn -> raise "fail" end)
Process.sleep(50)  # Hope state changed
{:ok, status} = CircuitBreaker.get_status(service_id)
assert status == :open

Problems:

  • Flaky: May not be long enough under load
  • Slow: Forces unnecessary waiting even when operation completes quickly
  • Masks Issues: Hides race conditions instead of solving them
  • Unreliable: System load affects test reliability

Event-Driven Solution

Correct Pattern:

# RIGHT: Wait for the specific event that indicates completion
test "circuit breaker opens on failure" do
  # Subscribe to state change events
  assert_telemetry_event [:foundation, :circuit_breaker, :state_change],
    %{service_id: "test_service", to: :open} do
    # Trigger the operation
    CircuitBreaker.call("test_service", fn -> raise "fail" end)
  end
  
  # Now we know the state changed, verify it
  {:ok, :open} = CircuitBreaker.get_status("test_service")
end

Benefits:

  • Deterministic: Test completes as soon as event occurs
  • Fast: No arbitrary delays
  • Reliable: Works under any system load
  • Clear: Test expresses exactly what it’s waiting for
  • Reveals Issues: Real race conditions become apparent

Comprehensive Test Helpers

Foundation.TelemetryTestHelpers

defmodule Foundation.TelemetryTestHelpers do
  @moduledoc """
  Comprehensive helpers for event-driven testing.
  Replaces Process.sleep with deterministic event-based synchronization.
  """

  # Basic event assertion  
  defmacro assert_telemetry_event(event_pattern, metadata_pattern, do: block)
  defmacro refute_telemetry_event(event_pattern, metadata_pattern, do: block)
  
  # Multiple event coordination
  defmacro assert_telemetry_sequence(event_sequence, do: block)
  defmacro assert_telemetry_any(event_patterns, do: block)
  
  # Service lifecycle coordination
  def wait_for_service_ready(service_name, timeout \\ 5000)
  def wait_for_service_health(service_name, expected_status, timeout \\ 5000)
  def wait_for_process_restart(process_name, timeout \\ 5000)
  
  # Agent coordination  
  def wait_for_agent_registration(agent_id, timeout \\ 5000)
  def wait_for_coordination_completion(session_id, timeout \\ 5000)
  
  # Resource coordination
  def wait_for_resource_cleanup(resource_type, timeout \\ 5000)
  def wait_for_connection_pool_ready(pool_id, timeout \\ 5000)
  
  # Performance-based coordination
  def wait_for_metric_threshold(metric_name, threshold, timeout \\ 5000)
  def wait_for_health_improvement(service_name, timeout \\ 5000)
end

Usage Examples

Circuit Breaker Testing

test "circuit breaker failure threshold" do
  service_id = "test_service"
  
  # Test that 3 failures open the circuit
  assert_telemetry_event [:foundation, :circuit_breaker, :state_change],
    %{service_id: ^service_id, to: :open} do
    
    # Generate exactly 3 failures
    for _ <- 1..3 do
      CircuitBreaker.call(service_id, fn -> raise "fail" end)
    end
  end
  
  # Verify circuit is open
  {:ok, :open} = CircuitBreaker.get_status(service_id)
end

Service Restart Testing

test "connection manager recovers from crash" do
  manager_pid = Process.whereis(Foundation.Services.ConnectionManager)
  
  # Wait for restart event, not arbitrary time
  assert_telemetry_event [:foundation, :service, :restarted],
    %{service: Foundation.Services.ConnectionManager} do
    
    Process.exit(manager_pid, :kill)
  end
  
  # Service is guaranteed to be restarted now
  new_pid = Process.whereis(Foundation.Services.ConnectionManager)
  assert is_pid(new_pid)
  assert new_pid != manager_pid
end

Multi-Agent Coordination Testing

test "agent coordination completes successfully" do
  agents = ["agent_1", "agent_2", "agent_3"]
  
  # Wait for coordination completion, not sleep
  assert_telemetry_event [:foundation, :mabeam, :coordination, :completed],
    %{result: :success} do
    
    MABEAM.Core.coordinate_agents(agents, :test_task)
  end
  
  # Coordination is guaranteed complete
  for agent_id <- agents do
    {:ok, :idle} = MABEAM.AgentRegistry.get_agent_status(agent_id)
  end
end

Complex Event Sequencing

test "retry service exhausts attempts then circuit breaker opens" do
  service_id = "flaky_service"
  
  # Wait for the complete sequence of events
  assert_telemetry_sequence [
    {[:foundation, :retry, :exhausted], %{total_attempts: 3}},
    {[:foundation, :circuit_breaker, :state_change], %{to: :open}}
  ] do
    # Trigger the chain of events
    CircuitBreaker.call(service_id, fn -> 
      RetryService.retry(fn -> raise "persistent failure" end, max_attempts: 3)
    end)
  end
  
  # Both systems are in expected final state
  {:ok, :open} = CircuitBreaker.get_status(service_id)
end

๐Ÿ”ง Implementation Strategy

Phase 1: Enhanced Core Infrastructure (Week 1)

  • Enhance Foundation.Telemetry with structured metrics
  • Implement Foundation.Services.TelemetryService
  • Add VM metrics integration
  • Create comprehensive test helpers

Phase 2: Service Integration (Week 2)

  • Add telemetry to all Foundation services
  • Implement automatic health check telemetry
  • Create service-specific event patterns
  • Migrate critical tests to event-driven patterns

Phase 3: MABEAM Integration (Week 3)

  • Add agent lifecycle telemetry
  • Implement coordination protocol events
  • Create multi-agent test helpers
  • Performance monitoring for agent systems

Phase 4: Advanced Analytics (Week 4)

  • Real-time metric aggregation
  • Alert configuration system
  • Performance trend analysis
  • Cross-service correlation

Phase 5: Process.sleep Elimination (Week 5)

  • Systematic replacement across all test files
  • Validation of event-driven patterns
  • Performance measurement and optimization
  • Documentation and training

๐Ÿ“Š Event Taxonomy

Standard Event Structure

# Event Name Pattern: [:foundation, :service, :operation]
# Metadata Pattern: %{standard_fields + operation_specific_fields}

%{
  # Standard fields (always present)
  timestamp: DateTime.utc_now(),
  duration: 1_234,  # microseconds (when applicable)
  node: Node.self(),
  
  # Context fields (when applicable)  
  service: Foundation.Services.ServiceName,
  process_id: self(),
  correlation_id: "uuid",
  
  # Operation-specific fields
  # ... varies by operation
}

Critical Events for Test Synchronization

Service Lifecycle

  • [:foundation, :service, :started] - Service initialized and ready
  • [:foundation, :service, :health_check] - Periodic health status
  • [:foundation, :service, :restarted] - Service recovered from crash

Async Operations

  • [:foundation, :async, :started] - Async operation initiated
  • [:foundation, :async, :completed] - Async operation finished
  • [:foundation, :async, :failed] - Async operation failed

Resource Management

  • [:foundation, :resource, :acquired] - Resource allocation completed
  • [:foundation, :resource, :released] - Resource cleanup completed
  • [:foundation, :resource, :exhausted] - Resource pool full

๐ŸŽฏ Success Metrics

Test Suite Performance

  • Target: 90% reduction in Process.sleep usage
  • Expected: 50%+ faster test execution
  • Quality: Zero flaky tests due to timing issues

Production Observability

  • Coverage: 100% of services emit operational telemetry
  • Alerting: Comprehensive alert coverage for all critical operations
  • Performance: Sub-millisecond telemetry event processing

Developer Experience

  • Clarity: Tests clearly express what they’re waiting for
  • Reliability: Tests pass consistently under load
  • Speed: Faster feedback loops during development

๐Ÿ”„ Migration from Process.sleep

Systematic Replacement Strategy

  1. Identify the Async Operation: What is the test actually waiting for?
  2. Find the Completion Event: What telemetry event indicates completion?
  3. Replace with Event Assertion: Use assert_telemetry_event instead of sleep
  4. Verify Deterministic Behavior: Test passes reliably under load

Common Patterns

Sleep PatternEvent-Driven Replacement
Process.sleep(50) after process killassert_telemetry_event [:foundation, :service, :restarted]
Process.sleep(100) after async taskassert_telemetry_event [:foundation, :async, :completed]
Process.sleep(200) after circuit breakerassert_telemetry_event [:foundation, :circuit_breaker, :state_change]
Process.sleep(30) after signal emitassert_telemetry_event [:jido, :signal, :processed]

๐Ÿ›ก๏ธ Edge Cases and Fallbacks

When Event-Driven Testing Isn’t Suitable

  1. Testing the Telemetry System Itself:

    # When testing telemetry event emission, you can't wait for the event you're testing
    test "telemetry events are emitted correctly" do
      capture_telemetry [:foundation, :test, :event] do
        Foundation.Telemetry.emit([:foundation, :test, :event], %{data: "test"})
      end
    end
    
  2. External System Integration:

    # When waiting for external systems that don't emit our telemetry
    test "external API responds" do
      # Acceptable use of minimal delay for external systems
      :timer.sleep(100)  # Much better than Process.sleep(100)
      response = ExternalAPI.check_status()
      assert response.status == :ready
    end
    
  3. Degraded Mode Testing:

    # When testing system behavior with telemetry disabled
    test "system works without telemetry" do
      :telemetry.detach_all()
      # Limited use of polling acceptable here
      wait_for(fn -> Service.ready?() end, 1000)
    end
    
  4. Time-Based Business Logic:

    # When testing actual time-dependent behavior
    test "rate limiter resets after interval" do
      # Here the time delay IS the feature being tested
      :timer.sleep(1100)  # Rate limit window is 1000ms
      assert RateLimiter.allow?(bucket_id)
    end
    

Fallback Strategies

  1. Hybrid Approach: Event-driven primary, polling fallback
  2. Timeout Configuration: Reasonable timeouts for event assertions
  3. Circuit Breaker Pattern: Graceful degradation when telemetry unavailable
  4. Test Environment Detection: Different strategies for different environments

๐ŸŽ‰ Conclusion

The Foundation Telemetry System transforms the codebase from guess-based async testing to deterministic event-driven coordination. By making every operation observable and every async operation testable through events, we achieve:

  • Fast, Reliable Tests - No more flaky timing issues
  • Production Observability - Comprehensive system monitoring
  • Clear Intent - Tests express exactly what they’re verifying
  • Future-Proof Architecture - Event-driven systems scale naturally

The key insight is that the system should tell us when it’s ready rather than us guessing when it might be ready. This philosophy transforms both testing and production operations into deterministic, observable, and maintainable patterns.


Foundation Telemetry Architecture
Version 1.0 - Production-Grade Event-Driven Infrastructure
Focus: Eliminate Process.sleep, Enable Observable Systems