TELEMETRY

Documentation for TELEMETRY from the Foundation repository.

Foundation Telemetry Architecture - Production-Grade Event-Driven Infrastructure

Executive Summary

The Foundation Telemetry System provides a comprehensive, event-driven infrastructure for observability, test synchronization, and system coordination. It bridges the gap between minimal telemetry wrappers and production-grade observability platforms while maintaining the elegant test synchronization patterns that eliminate Process.sleep anti-patterns.

🎯 Core Philosophy: Event-Driven Everything

The system should emit events for everything it does, making all operations observable and all async operations testable through event-driven synchronization rather than arbitrary delays.

Key Principles

Every Operation Emits Events - All services, processes, and operations emit structured telemetry events
Tests Subscribe to Events - Replace Process.sleep/1 with event-driven assertions
Production Observability - Comprehensive metrics collection, alerting, and dashboards
Deterministic Testing - Tests wait for specific events, not arbitrary time periods
Service Integration - All Foundation services emit operational telemetry automatically

🏗️ System Architecture

Layer 1: Core Telemetry Infrastructure

Foundation.Telemetry (Enhanced)

Current: Basic :telemetry wrapper
Enhanced: Production-ready telemetry orchestrator

New Capabilities:

Structured metric types (counter, gauge, histogram, summary)
Event routing and filtering
Handler lifecycle management
Event correlation and causality tracking
Performance optimization with handler pooling

Foundation.Services.TelemetryService

Purpose: Centralized telemetry collection and aggregation service
Responsibilities:

Metric storage with configurable retention
Real-time metric aggregation
Alert evaluation and firing
Handler registration and health monitoring
VM metrics integration
Cross-service event correlation

Layer 2: Service Integration Telemetry

Automatic Service Instrumentation

Every Foundation service automatically emits operational telemetry:

Foundation.Services.CircuitBreaker:

# Automatic events
[:foundation, :circuit_breaker, :call_start] - %{service_id: id, timeout: ms}
[:foundation, :circuit_breaker, :call_success] - %{service_id: id, duration: ms}  
[:foundation, :circuit_breaker, :call_failure] - %{service_id: id, error: reason}
[:foundation, :circuit_breaker, :state_change] - %{service_id: id, from: state, to: state}
[:foundation, :circuit_breaker, :health_check] - %{service_id: id, healthy: boolean}

Foundation.Services.RetryService:

[:foundation, :retry, :attempt_start] - %{attempt: n, total_attempts: max}
[:foundation, :retry, :attempt_success] - %{attempt: n, duration: ms}
[:foundation, :retry, :attempt_failure] - %{attempt: n, error: reason}
[:foundation, :retry, :backoff] - %{delay: ms, strategy: atom}
[:foundation, :retry, :exhausted] - %{total_attempts: n, final_error: reason}

Foundation.Services.ConnectionManager:

[:foundation, :connection, :pool_created] - %{pool_id: id, size: n}
[:foundation, :connection, :connection_acquired] - %{pool_id: id, duration: ms}
[:foundation, :connection, :connection_released] - %{pool_id: id}
[:foundation, :connection, :connection_failed] - %{pool_id: id, error: reason}
[:foundation, :connection, :pool_health] - %{pool_id: id, active: n, idle: n}

Health Check Integration

[:foundation, :service, :health_check] - %{service: name, status: atom, metrics: map}
[:foundation, :service, :started] - %{service: name, pid: pid}
[:foundation, :service, :stopped] - %{service: name, reason: atom}
[:foundation, :service, :restarted] - %{service: name, restart_count: n}

Layer 3: MABEAM Multi-Agent Telemetry

Agent Lifecycle Events

[:foundation, :mabeam, :agent, :registered] - %{agent_id: id, capabilities: list}
[:foundation, :mabeam, :agent, :unregistered] - %{agent_id: id, reason: atom}
[:foundation, :mabeam, :agent, :health_check] - %{agent_id: id, healthy: boolean, metrics: map}

Coordination Protocol Events

[:foundation, :mabeam, :coordination, :started] - %{session_id: id, agents: list}
[:foundation, :mabeam, :coordination, :message_sent] - %{from: id, to: id, type: atom}
[:foundation, :mabeam, :coordination, :message_received] - %{from: id, to: id, duration: ms}
[:foundation, :mabeam, :coordination, :completed] - %{session_id: id, result: atom, duration: ms}
[:foundation, :mabeam, :coordination, :failed] - %{session_id: id, error: reason}

Layer 4: Advanced Analytics & Alerting

Foundation.Telemetry.Analytics

Real-time Metric Processing:

Time-series data collection
Moving averages and percentiles
Anomaly detection algorithms
Performance trend analysis
Cross-service correlation

Foundation.Telemetry.Alerting

Configurable Alert System:

Rule-based alert definitions
Escalation policies
Alert correlation and deduplication
Integration with external systems (PagerDuty, Slack)

🧪 Test Philosophy: Event-Driven Testing

The Problem with `Process.sleep/1`

Anti-Pattern:

# WRONG: Guessing how long an operation takes
CircuitBreaker.call(service_id, fn -> raise "fail" end)
Process.sleep(50)  # Hope state changed
{:ok, status} = CircuitBreaker.get_status(service_id)
assert status == :open

Problems:

Flaky: May not be long enough under load
Slow: Forces unnecessary waiting even when operation completes quickly
Masks Issues: Hides race conditions instead of solving them
Unreliable: System load affects test reliability

Event-Driven Solution

Correct Pattern:

# RIGHT: Wait for the specific event that indicates completion
test "circuit breaker opens on failure" do
  # Subscribe to state change events
  assert_telemetry_event [:foundation, :circuit_breaker, :state_change],
    %{service_id: "test_service", to: :open} do
    # Trigger the operation
    CircuitBreaker.call("test_service", fn -> raise "fail" end)
  end
  
  # Now we know the state changed, verify it
  {:ok, :open} = CircuitBreaker.get_status("test_service")
end

Benefits:

Deterministic: Test completes as soon as event occurs
Fast: No arbitrary delays
Reliable: Works under any system load
Clear: Test expresses exactly what it’s waiting for
Reveals Issues: Real race conditions become apparent

Comprehensive Test Helpers

Foundation.TelemetryTestHelpers

defmodule Foundation.TelemetryTestHelpers do
  @moduledoc """
  Comprehensive helpers for event-driven testing.
  Replaces Process.sleep with deterministic event-based synchronization.
  """

  # Basic event assertion  
  defmacro assert_telemetry_event(event_pattern, metadata_pattern, do: block)
  defmacro refute_telemetry_event(event_pattern, metadata_pattern, do: block)
  
  # Multiple event coordination
  defmacro assert_telemetry_sequence(event_sequence, do: block)
  defmacro assert_telemetry_any(event_patterns, do: block)
  
  # Service lifecycle coordination
  def wait_for_service_ready(service_name, timeout \\ 5000)
  def wait_for_service_health(service_name, expected_status, timeout \\ 5000)
  def wait_for_process_restart(process_name, timeout \\ 5000)
  
  # Agent coordination  
  def wait_for_agent_registration(agent_id, timeout \\ 5000)
  def wait_for_coordination_completion(session_id, timeout \\ 5000)
  
  # Resource coordination
  def wait_for_resource_cleanup(resource_type, timeout \\ 5000)
  def wait_for_connection_pool_ready(pool_id, timeout \\ 5000)
  
  # Performance-based coordination
  def wait_for_metric_threshold(metric_name, threshold, timeout \\ 5000)
  def wait_for_health_improvement(service_name, timeout \\ 5000)
end

Usage Examples

Circuit Breaker Testing

test "circuit breaker failure threshold" do
  service_id = "test_service"
  
  # Test that 3 failures open the circuit
  assert_telemetry_event [:foundation, :circuit_breaker, :state_change],
    %{service_id: ^service_id, to: :open} do
    
    # Generate exactly 3 failures
    for _ <- 1..3 do
      CircuitBreaker.call(service_id, fn -> raise "fail" end)
    end
  end
  
  # Verify circuit is open
  {:ok, :open} = CircuitBreaker.get_status(service_id)
end

Service Restart Testing

test "connection manager recovers from crash" do
  manager_pid = Process.whereis(Foundation.Services.ConnectionManager)
  
  # Wait for restart event, not arbitrary time
  assert_telemetry_event [:foundation, :service, :restarted],
    %{service: Foundation.Services.ConnectionManager} do
    
    Process.exit(manager_pid, :kill)
  end
  
  # Service is guaranteed to be restarted now
  new_pid = Process.whereis(Foundation.Services.ConnectionManager)
  assert is_pid(new_pid)
  assert new_pid != manager_pid
end

Multi-Agent Coordination Testing

test "agent coordination completes successfully" do
  agents = ["agent_1", "agent_2", "agent_3"]
  
  # Wait for coordination completion, not sleep
  assert_telemetry_event [:foundation, :mabeam, :coordination, :completed],
    %{result: :success} do
    
    MABEAM.Core.coordinate_agents(agents, :test_task)
  end
  
  # Coordination is guaranteed complete
  for agent_id <- agents do
    {:ok, :idle} = MABEAM.AgentRegistry.get_agent_status(agent_id)
  end
end

Complex Event Sequencing

test "retry service exhausts attempts then circuit breaker opens" do
  service_id = "flaky_service"
  
  # Wait for the complete sequence of events
  assert_telemetry_sequence [
    {[:foundation, :retry, :exhausted], %{total_attempts: 3}},
    {[:foundation, :circuit_breaker, :state_change], %{to: :open}}
  ] do
    # Trigger the chain of events
    CircuitBreaker.call(service_id, fn -> 
      RetryService.retry(fn -> raise "persistent failure" end, max_attempts: 3)
    end)
  end
  
  # Both systems are in expected final state
  {:ok, :open} = CircuitBreaker.get_status(service_id)
end

🔧 Implementation Strategy

Phase 1: Enhanced Core Infrastructure (Week 1)

Enhance Foundation.Telemetry with structured metrics
Implement Foundation.Services.TelemetryService
Add VM metrics integration
Create comprehensive test helpers

Phase 2: Service Integration (Week 2)

Add telemetry to all Foundation services
Implement automatic health check telemetry
Create service-specific event patterns
Migrate critical tests to event-driven patterns

Phase 3: MABEAM Integration (Week 3)

Add agent lifecycle telemetry
Implement coordination protocol events
Create multi-agent test helpers
Performance monitoring for agent systems

Phase 4: Advanced Analytics (Week 4)

Real-time metric aggregation
Alert configuration system
Performance trend analysis
Cross-service correlation

Phase 5: Process.sleep Elimination (Week 5)

Systematic replacement across all test files
Validation of event-driven patterns
Performance measurement and optimization
Documentation and training

📊 Event Taxonomy

Standard Event Structure

# Event Name Pattern: [:foundation, :service, :operation]
# Metadata Pattern: %{standard_fields + operation_specific_fields}

%{
  # Standard fields (always present)
  timestamp: DateTime.utc_now(),
  duration: 1_234,  # microseconds (when applicable)
  node: Node.self(),
  
  # Context fields (when applicable)  
  service: Foundation.Services.ServiceName,
  process_id: self(),
  correlation_id: "uuid",
  
  # Operation-specific fields
  # ... varies by operation
}

Critical Events for Test Synchronization

Service Lifecycle

[:foundation, :service, :started] - Service initialized and ready
[:foundation, :service, :health_check] - Periodic health status
[:foundation, :service, :restarted] - Service recovered from crash

Async Operations

[:foundation, :async, :started] - Async operation initiated
[:foundation, :async, :completed] - Async operation finished
[:foundation, :async, :failed] - Async operation failed

Resource Management

[:foundation, :resource, :acquired] - Resource allocation completed
[:foundation, :resource, :released] - Resource cleanup completed
[:foundation, :resource, :exhausted] - Resource pool full

🎯 Success Metrics

Test Suite Performance

Target: 90% reduction in Process.sleep usage
Expected: 50%+ faster test execution
Quality: Zero flaky tests due to timing issues

Production Observability

Coverage: 100% of services emit operational telemetry
Alerting: Comprehensive alert coverage for all critical operations
Performance: Sub-millisecond telemetry event processing

Developer Experience

Clarity: Tests clearly express what they’re waiting for
Reliability: Tests pass consistently under load
Speed: Faster feedback loops during development

🔄 Migration from Process.sleep

Systematic Replacement Strategy

Identify the Async Operation: What is the test actually waiting for?
Find the Completion Event: What telemetry event indicates completion?
Replace with Event Assertion: Use assert_telemetry_event instead of sleep
Verify Deterministic Behavior: Test passes reliably under load

Common Patterns

Sleep Pattern	Event-Driven Replacement
`Process.sleep(50)` after process kill	`assert_telemetry_event [:foundation, :service, :restarted]`
`Process.sleep(100)` after async task	`assert_telemetry_event [:foundation, :async, :completed]`
`Process.sleep(200)` after circuit breaker	`assert_telemetry_event [:foundation, :circuit_breaker, :state_change]`
`Process.sleep(30)` after signal emit	`assert_telemetry_event [:jido, :signal, :processed]`

🛡️ Edge Cases and Fallbacks

When Event-Driven Testing Isn’t Suitable

Testing the Telemetry System Itself:

# When testing telemetry event emission, you can't wait for the event you're testing
test "telemetry events are emitted correctly" do
  capture_telemetry [:foundation, :test, :event] do
    Foundation.Telemetry.emit([:foundation, :test, :event], %{data: "test"})
  end
end

External System Integration:

# When waiting for external systems that don't emit our telemetry
test "external API responds" do
  # Acceptable use of minimal delay for external systems
  :timer.sleep(100)  # Much better than Process.sleep(100)
  response = ExternalAPI.check_status()
  assert response.status == :ready
end

Degraded Mode Testing:

# When testing system behavior with telemetry disabled
test "system works without telemetry" do
  :telemetry.detach_all()
  # Limited use of polling acceptable here
  wait_for(fn -> Service.ready?() end, 1000)
end

Time-Based Business Logic:

# When testing actual time-dependent behavior
test "rate limiter resets after interval" do
  # Here the time delay IS the feature being tested
  :timer.sleep(1100)  # Rate limit window is 1000ms
  assert RateLimiter.allow?(bucket_id)
end

Fallback Strategies

Hybrid Approach: Event-driven primary, polling fallback
Timeout Configuration: Reasonable timeouts for event assertions
Circuit Breaker Pattern: Graceful degradation when telemetry unavailable
Test Environment Detection: Different strategies for different environments

🎉 Conclusion

The Foundation Telemetry System transforms the codebase from guess-based async testing to deterministic event-driven coordination. By making every operation observable and every async operation testable through events, we achieve:

Fast, Reliable Tests - No more flaky timing issues
Production Observability - Comprehensive system monitoring
Clear Intent - Tests express exactly what they’re verifying
Future-Proof Architecture - Event-driven systems scale naturally

The key insight is that the system should tell us when it’s ready rather than us guessing when it might be ready. This philosophy transforms both testing and production operations into deterministic, observable, and maintainable patterns.

Foundation Telemetry Architecture
Version 1.0 - Production-Grade Event-Driven Infrastructure
Focus: Eliminate Process.sleep, Enable Observable Systems