Foundation Telemetry Architecture - Production-Grade Event-Driven Infrastructure
Executive Summary
The Foundation Telemetry System provides a comprehensive, event-driven infrastructure for observability, test synchronization, and system coordination. It bridges the gap between minimal telemetry wrappers and production-grade observability platforms while maintaining the elegant test synchronization patterns that eliminate Process.sleep
anti-patterns.
๐ฏ Core Philosophy: Event-Driven Everything
The system should emit events for everything it does, making all operations observable and all async operations testable through event-driven synchronization rather than arbitrary delays.
Key Principles
- Every Operation Emits Events - All services, processes, and operations emit structured telemetry events
- Tests Subscribe to Events - Replace
Process.sleep/1
with event-driven assertions - Production Observability - Comprehensive metrics collection, alerting, and dashboards
- Deterministic Testing - Tests wait for specific events, not arbitrary time periods
- Service Integration - All Foundation services emit operational telemetry automatically
๐๏ธ System Architecture
Layer 1: Core Telemetry Infrastructure
Foundation.Telemetry (Enhanced)
Current: Basic :telemetry
wrapper
Enhanced: Production-ready telemetry orchestrator
New Capabilities:
- Structured metric types (counter, gauge, histogram, summary)
- Event routing and filtering
- Handler lifecycle management
- Event correlation and causality tracking
- Performance optimization with handler pooling
Foundation.Services.TelemetryService
Purpose: Centralized telemetry collection and aggregation service
Responsibilities:
- Metric storage with configurable retention
- Real-time metric aggregation
- Alert evaluation and firing
- Handler registration and health monitoring
- VM metrics integration
- Cross-service event correlation
Layer 2: Service Integration Telemetry
Automatic Service Instrumentation
Every Foundation service automatically emits operational telemetry:
Foundation.Services.CircuitBreaker:
# Automatic events
[:foundation, :circuit_breaker, :call_start] - %{service_id: id, timeout: ms}
[:foundation, :circuit_breaker, :call_success] - %{service_id: id, duration: ms}
[:foundation, :circuit_breaker, :call_failure] - %{service_id: id, error: reason}
[:foundation, :circuit_breaker, :state_change] - %{service_id: id, from: state, to: state}
[:foundation, :circuit_breaker, :health_check] - %{service_id: id, healthy: boolean}
Foundation.Services.RetryService:
[:foundation, :retry, :attempt_start] - %{attempt: n, total_attempts: max}
[:foundation, :retry, :attempt_success] - %{attempt: n, duration: ms}
[:foundation, :retry, :attempt_failure] - %{attempt: n, error: reason}
[:foundation, :retry, :backoff] - %{delay: ms, strategy: atom}
[:foundation, :retry, :exhausted] - %{total_attempts: n, final_error: reason}
Foundation.Services.ConnectionManager:
[:foundation, :connection, :pool_created] - %{pool_id: id, size: n}
[:foundation, :connection, :connection_acquired] - %{pool_id: id, duration: ms}
[:foundation, :connection, :connection_released] - %{pool_id: id}
[:foundation, :connection, :connection_failed] - %{pool_id: id, error: reason}
[:foundation, :connection, :pool_health] - %{pool_id: id, active: n, idle: n}
Health Check Integration
[:foundation, :service, :health_check] - %{service: name, status: atom, metrics: map}
[:foundation, :service, :started] - %{service: name, pid: pid}
[:foundation, :service, :stopped] - %{service: name, reason: atom}
[:foundation, :service, :restarted] - %{service: name, restart_count: n}
Layer 3: MABEAM Multi-Agent Telemetry
Agent Lifecycle Events
[:foundation, :mabeam, :agent, :registered] - %{agent_id: id, capabilities: list}
[:foundation, :mabeam, :agent, :unregistered] - %{agent_id: id, reason: atom}
[:foundation, :mabeam, :agent, :health_check] - %{agent_id: id, healthy: boolean, metrics: map}
Coordination Protocol Events
[:foundation, :mabeam, :coordination, :started] - %{session_id: id, agents: list}
[:foundation, :mabeam, :coordination, :message_sent] - %{from: id, to: id, type: atom}
[:foundation, :mabeam, :coordination, :message_received] - %{from: id, to: id, duration: ms}
[:foundation, :mabeam, :coordination, :completed] - %{session_id: id, result: atom, duration: ms}
[:foundation, :mabeam, :coordination, :failed] - %{session_id: id, error: reason}
Layer 4: Advanced Analytics & Alerting
Foundation.Telemetry.Analytics
Real-time Metric Processing:
- Time-series data collection
- Moving averages and percentiles
- Anomaly detection algorithms
- Performance trend analysis
- Cross-service correlation
Foundation.Telemetry.Alerting
Configurable Alert System:
- Rule-based alert definitions
- Escalation policies
- Alert correlation and deduplication
- Integration with external systems (PagerDuty, Slack)
๐งช Test Philosophy: Event-Driven Testing
The Problem with Process.sleep/1
Anti-Pattern:
# WRONG: Guessing how long an operation takes
CircuitBreaker.call(service_id, fn -> raise "fail" end)
Process.sleep(50) # Hope state changed
{:ok, status} = CircuitBreaker.get_status(service_id)
assert status == :open
Problems:
- Flaky: May not be long enough under load
- Slow: Forces unnecessary waiting even when operation completes quickly
- Masks Issues: Hides race conditions instead of solving them
- Unreliable: System load affects test reliability
Event-Driven Solution
Correct Pattern:
# RIGHT: Wait for the specific event that indicates completion
test "circuit breaker opens on failure" do
# Subscribe to state change events
assert_telemetry_event [:foundation, :circuit_breaker, :state_change],
%{service_id: "test_service", to: :open} do
# Trigger the operation
CircuitBreaker.call("test_service", fn -> raise "fail" end)
end
# Now we know the state changed, verify it
{:ok, :open} = CircuitBreaker.get_status("test_service")
end
Benefits:
- Deterministic: Test completes as soon as event occurs
- Fast: No arbitrary delays
- Reliable: Works under any system load
- Clear: Test expresses exactly what it’s waiting for
- Reveals Issues: Real race conditions become apparent
Comprehensive Test Helpers
Foundation.TelemetryTestHelpers
defmodule Foundation.TelemetryTestHelpers do
@moduledoc """
Comprehensive helpers for event-driven testing.
Replaces Process.sleep with deterministic event-based synchronization.
"""
# Basic event assertion
defmacro assert_telemetry_event(event_pattern, metadata_pattern, do: block)
defmacro refute_telemetry_event(event_pattern, metadata_pattern, do: block)
# Multiple event coordination
defmacro assert_telemetry_sequence(event_sequence, do: block)
defmacro assert_telemetry_any(event_patterns, do: block)
# Service lifecycle coordination
def wait_for_service_ready(service_name, timeout \\ 5000)
def wait_for_service_health(service_name, expected_status, timeout \\ 5000)
def wait_for_process_restart(process_name, timeout \\ 5000)
# Agent coordination
def wait_for_agent_registration(agent_id, timeout \\ 5000)
def wait_for_coordination_completion(session_id, timeout \\ 5000)
# Resource coordination
def wait_for_resource_cleanup(resource_type, timeout \\ 5000)
def wait_for_connection_pool_ready(pool_id, timeout \\ 5000)
# Performance-based coordination
def wait_for_metric_threshold(metric_name, threshold, timeout \\ 5000)
def wait_for_health_improvement(service_name, timeout \\ 5000)
end
Usage Examples
Circuit Breaker Testing
test "circuit breaker failure threshold" do
service_id = "test_service"
# Test that 3 failures open the circuit
assert_telemetry_event [:foundation, :circuit_breaker, :state_change],
%{service_id: ^service_id, to: :open} do
# Generate exactly 3 failures
for _ <- 1..3 do
CircuitBreaker.call(service_id, fn -> raise "fail" end)
end
end
# Verify circuit is open
{:ok, :open} = CircuitBreaker.get_status(service_id)
end
Service Restart Testing
test "connection manager recovers from crash" do
manager_pid = Process.whereis(Foundation.Services.ConnectionManager)
# Wait for restart event, not arbitrary time
assert_telemetry_event [:foundation, :service, :restarted],
%{service: Foundation.Services.ConnectionManager} do
Process.exit(manager_pid, :kill)
end
# Service is guaranteed to be restarted now
new_pid = Process.whereis(Foundation.Services.ConnectionManager)
assert is_pid(new_pid)
assert new_pid != manager_pid
end
Multi-Agent Coordination Testing
test "agent coordination completes successfully" do
agents = ["agent_1", "agent_2", "agent_3"]
# Wait for coordination completion, not sleep
assert_telemetry_event [:foundation, :mabeam, :coordination, :completed],
%{result: :success} do
MABEAM.Core.coordinate_agents(agents, :test_task)
end
# Coordination is guaranteed complete
for agent_id <- agents do
{:ok, :idle} = MABEAM.AgentRegistry.get_agent_status(agent_id)
end
end
Complex Event Sequencing
test "retry service exhausts attempts then circuit breaker opens" do
service_id = "flaky_service"
# Wait for the complete sequence of events
assert_telemetry_sequence [
{[:foundation, :retry, :exhausted], %{total_attempts: 3}},
{[:foundation, :circuit_breaker, :state_change], %{to: :open}}
] do
# Trigger the chain of events
CircuitBreaker.call(service_id, fn ->
RetryService.retry(fn -> raise "persistent failure" end, max_attempts: 3)
end)
end
# Both systems are in expected final state
{:ok, :open} = CircuitBreaker.get_status(service_id)
end
๐ง Implementation Strategy
Phase 1: Enhanced Core Infrastructure (Week 1)
- Enhance
Foundation.Telemetry
with structured metrics - Implement
Foundation.Services.TelemetryService
- Add VM metrics integration
- Create comprehensive test helpers
Phase 2: Service Integration (Week 2)
- Add telemetry to all Foundation services
- Implement automatic health check telemetry
- Create service-specific event patterns
- Migrate critical tests to event-driven patterns
Phase 3: MABEAM Integration (Week 3)
- Add agent lifecycle telemetry
- Implement coordination protocol events
- Create multi-agent test helpers
- Performance monitoring for agent systems
Phase 4: Advanced Analytics (Week 4)
- Real-time metric aggregation
- Alert configuration system
- Performance trend analysis
- Cross-service correlation
Phase 5: Process.sleep Elimination (Week 5)
- Systematic replacement across all test files
- Validation of event-driven patterns
- Performance measurement and optimization
- Documentation and training
๐ Event Taxonomy
Standard Event Structure
# Event Name Pattern: [:foundation, :service, :operation]
# Metadata Pattern: %{standard_fields + operation_specific_fields}
%{
# Standard fields (always present)
timestamp: DateTime.utc_now(),
duration: 1_234, # microseconds (when applicable)
node: Node.self(),
# Context fields (when applicable)
service: Foundation.Services.ServiceName,
process_id: self(),
correlation_id: "uuid",
# Operation-specific fields
# ... varies by operation
}
Critical Events for Test Synchronization
Service Lifecycle
[:foundation, :service, :started]
- Service initialized and ready[:foundation, :service, :health_check]
- Periodic health status[:foundation, :service, :restarted]
- Service recovered from crash
Async Operations
[:foundation, :async, :started]
- Async operation initiated[:foundation, :async, :completed]
- Async operation finished[:foundation, :async, :failed]
- Async operation failed
Resource Management
[:foundation, :resource, :acquired]
- Resource allocation completed[:foundation, :resource, :released]
- Resource cleanup completed[:foundation, :resource, :exhausted]
- Resource pool full
๐ฏ Success Metrics
Test Suite Performance
- Target: 90% reduction in
Process.sleep
usage - Expected: 50%+ faster test execution
- Quality: Zero flaky tests due to timing issues
Production Observability
- Coverage: 100% of services emit operational telemetry
- Alerting: Comprehensive alert coverage for all critical operations
- Performance: Sub-millisecond telemetry event processing
Developer Experience
- Clarity: Tests clearly express what they’re waiting for
- Reliability: Tests pass consistently under load
- Speed: Faster feedback loops during development
๐ Migration from Process.sleep
Systematic Replacement Strategy
- Identify the Async Operation: What is the test actually waiting for?
- Find the Completion Event: What telemetry event indicates completion?
- Replace with Event Assertion: Use
assert_telemetry_event
instead of sleep - Verify Deterministic Behavior: Test passes reliably under load
Common Patterns
Sleep Pattern | Event-Driven Replacement |
---|---|
Process.sleep(50) after process kill | assert_telemetry_event [:foundation, :service, :restarted] |
Process.sleep(100) after async task | assert_telemetry_event [:foundation, :async, :completed] |
Process.sleep(200) after circuit breaker | assert_telemetry_event [:foundation, :circuit_breaker, :state_change] |
Process.sleep(30) after signal emit | assert_telemetry_event [:jido, :signal, :processed] |
๐ก๏ธ Edge Cases and Fallbacks
When Event-Driven Testing Isn’t Suitable
Testing the Telemetry System Itself:
# When testing telemetry event emission, you can't wait for the event you're testing test "telemetry events are emitted correctly" do capture_telemetry [:foundation, :test, :event] do Foundation.Telemetry.emit([:foundation, :test, :event], %{data: "test"}) end end
External System Integration:
# When waiting for external systems that don't emit our telemetry test "external API responds" do # Acceptable use of minimal delay for external systems :timer.sleep(100) # Much better than Process.sleep(100) response = ExternalAPI.check_status() assert response.status == :ready end
Degraded Mode Testing:
# When testing system behavior with telemetry disabled test "system works without telemetry" do :telemetry.detach_all() # Limited use of polling acceptable here wait_for(fn -> Service.ready?() end, 1000) end
Time-Based Business Logic:
# When testing actual time-dependent behavior test "rate limiter resets after interval" do # Here the time delay IS the feature being tested :timer.sleep(1100) # Rate limit window is 1000ms assert RateLimiter.allow?(bucket_id) end
Fallback Strategies
- Hybrid Approach: Event-driven primary, polling fallback
- Timeout Configuration: Reasonable timeouts for event assertions
- Circuit Breaker Pattern: Graceful degradation when telemetry unavailable
- Test Environment Detection: Different strategies for different environments
๐ Conclusion
The Foundation Telemetry System transforms the codebase from guess-based async testing to deterministic event-driven coordination. By making every operation observable and every async operation testable through events, we achieve:
- Fast, Reliable Tests - No more flaky timing issues
- Production Observability - Comprehensive system monitoring
- Clear Intent - Tests express exactly what they’re verifying
- Future-Proof Architecture - Event-driven systems scale naturally
The key insight is that the system should tell us when it’s ready rather than us guessing when it might be ready. This philosophy transforms both testing and production operations into deterministic, observable, and maintainable patterns.
Foundation Telemetry Architecture
Version 1.0 - Production-Grade Event-Driven Infrastructure
Focus: Eliminate Process.sleep, Enable Observable Systems