TEST PHILOSOPHY

Documentation for TEST_PHILOSOPHY from the Foundation repository.

Foundation Test Philosophy - Event-Driven Deterministic Testing

Core Philosophy: The System Should Tell Us When It’s Ready

Fundamental Principle: Tests should wait for the system to explicitly signal completion rather than guessing how long operations take.

“A test that uses Process.sleep/1 is a test that doesn’t understand what it’s waiting for.”

The Anti-Pattern: Process.sleep-Based Testing

What We’re Fighting Against

Sleep-Based Testing represents a fundamental misunderstanding of asynchronous systems:

# ANTI-PATTERN: Guessing completion times
test "service restarts after crash" do
  pid = Process.whereis(MyService)
  Process.exit(pid, :kill)
  Process.sleep(200)  # 🚨 GUESS: "200ms should be enough"
  
  new_pid = Process.whereis(MyService)
  assert new_pid != pid
end

Problems with this approach:

Unreliable: Works on developer machine, fails in CI under load
Slow: Forces unnecessary waiting even when operation completes in 5ms
Flaky: Different system loads produce different timing
Masks Issues: Hides race conditions instead of exposing them
Non-Deterministic: Same test can pass or fail based on timing luck

The Cognitive Failure

Sleep-based testing indicates we don’t understand our own system:

We don’t know when the operation actually completes
We don’t know what signals indicate completion
We don’t trust our system’s ability to communicate its state
We resort to “magical thinking” about timing

The Solution: Event-Driven Testing

Principle 1: Observable Operations

Every operation should emit events that indicate its lifecycle:

# CORRECT: Event-driven testing
test "service restarts after crash" do
  pid = Process.whereis(MyService)
  
  # Wait for the explicit restart event
  assert_telemetry_event [:foundation, :service, :restarted],
    %{service: MyService, new_pid: new_pid} do
    Process.exit(pid, :kill)
  end
  
  # Guaranteed: service has restarted
  assert new_pid != pid
  assert Process.alive?(new_pid)
end

Benefits:

Deterministic: Completes as soon as event occurs
Fast: No arbitrary delays
Reliable: Works under any system load
Clear: Expresses exactly what we’re waiting for
Exposes Issues: Real race conditions become apparent

Principle 2: System Self-Description

The system should describe its own state changes rather than tests guessing:

# System emits: [:foundation, :circuit_breaker, :state_change]
# Test listens: assert_telemetry_event [:foundation, :circuit_breaker, :state_change]

# System emits: [:foundation, :async, :completed] 
# Test listens: assert_telemetry_event [:foundation, :async, :completed]

# System emits: [:foundation, :resource, :cleanup_finished]
# Test listens: assert_telemetry_event [:foundation, :resource, :cleanup_finished]

Principle 3: Explicit Dependencies

Tests should explicitly wait for their dependencies rather than hoping timing works out:

# WRONG: Hope both operations complete in time
test "complex workflow" do
  start_async_operation_1()
  start_async_operation_2()  
  Process.sleep(500)  # 🚨 Hope both finished
  verify_combined_result()
end

# RIGHT: Wait for explicit completion of each dependency
test "complex workflow" do
  assert_telemetry_event [:app, :operation_1, :completed], %{} do
    start_async_operation_1()
  end
  
  assert_telemetry_event [:app, :operation_2, :completed], %{} do
    start_async_operation_2()
  end
  
  # Both operations guaranteed complete
  verify_combined_result()
end

Testing Categories and Strategies

Category 1: Standard Async Operations (95% of cases)

Rule: Use event-driven coordination

Examples:

Service startup/shutdown
Process restarts
Background task completion
State machine transitions
Resource allocation/cleanup
Circuit breaker state changes
Agent coordination

Pattern:

test "operation completes correctly" do
  assert_telemetry_event [:system, :operation, :completed], expected_metadata do
    trigger_async_operation()
  end
  
  verify_final_state()
end

Category 2: Testing Telemetry System Itself (2% of cases)

Rule: Use telemetry capture patterns

Challenge: Can’t wait for events when testing event emission itself

Pattern:

test "system emits correct events" do
  events = capture_telemetry [:system, :operation] do
    perform_operation()
  end
  
  assert length(events) == 2
  assert Enum.any?(events, fn {event, metadata} ->
    event == [:system, :operation, :started]
  end)
end

Category 3: External System Integration (2% of cases)

Rule: Limited use of minimal delays acceptable

Challenge: External systems don’t emit our telemetry events

Pattern:

test "external API integration" do
  trigger_external_call()
  
  # Acceptable: Minimal delay for external systems we don't control
  :timer.sleep(100)  # Better than Process.sleep
  
  # Or better: Polling with timeout
  wait_for(fn -> ExternalSystem.ready?() end, 2000)
end

Category 4: Time-Based Business Logic (1% of cases)

Rule: Delay IS the feature being tested

Examples: Rate limiting, timeouts, scheduled operations

Pattern:

test "rate limiter resets after window" do
  # Fill the rate limit bucket
  fill_rate_limit_bucket()
  assert :denied = RateLimiter.check(user_id)
  
  # Wait for reset window (this IS the feature)  
  :timer.sleep(rate_limit_window_ms + 10)
  
  # Should be reset now
  assert :allowed = RateLimiter.check(user_id)
end

Advanced Testing Patterns

Pattern 1: Event Sequences

For complex workflows with multiple async steps:

test "distributed agent coordination" do
  agents = ["agent_1", "agent_2", "agent_3"]
  
  # Wait for entire sequence of coordination events
  assert_telemetry_sequence [
    {[:mabeam, :coordination, :started], %{agents: ^agents}},
    {[:mabeam, :coordination, :sync_point], %{phase: 1}},
    {[:mabeam, :coordination, :sync_point], %{phase: 2}},
    {[:mabeam, :coordination, :completed], %{result: :success}}
  ] do
    MABEAM.coordinate_agents(agents, :complex_task)
  end
  
  # All agents guaranteed to be in final state
  for agent <- agents do
    assert {:ok, :idle} = MABEAM.get_agent_status(agent)
  end
end

Pattern 2: Conditional Events

For operations that may succeed or fail:

test "circuit breaker opens on threshold" do
  service_id = "test_service"
  
  # Wait for either success or circuit opening
  assert_telemetry_any [
    {[:foundation, :circuit_breaker, :state_change], %{to: :open}},
    {[:foundation, :circuit_breaker, :call_success], %{}}
  ] do
    # Generate load that should trip circuit breaker
    generate_load(service_id, failure_rate: 0.8)
  end
  
  # React based on which event occurred
  case CircuitBreaker.get_status(service_id) do
    {:ok, :open} -> assert_circuit_breaker_behavior()
    {:ok, :closed} -> assert_successful_operation()
  end
end

Pattern 3: Performance-Based Coordination

For operations where completion is indicated by performance metrics:

test "system reaches steady state performance" do
  start_system_under_load()
  
  # Wait for performance to stabilize
  wait_for_metric_threshold("response_time_p95", 100, timeout: 30_000)
  wait_for_metric_threshold("error_rate", 0.01, timeout: 30_000)
  
  # System guaranteed to be in steady state
  verify_steady_state_behavior()
end

Error Patterns and Solutions

Error Pattern 1: “It works on my machine”

Symptom: Tests pass locally but fail in CI

Root Cause: Sleep durations tuned for developer machine performance

Solution: Replace with event-driven coordination

# WRONG: Tuned for local machine
Process.sleep(50)  # Works locally, fails in CI

# RIGHT: Works everywhere
assert_telemetry_event [:system, :ready], %{}

Error Pattern 2: “Flaky tests”

Symptom: Tests sometimes pass, sometimes fail

Root Cause: Race conditions masked by arbitrary delays

Solution: Expose and fix the race condition

# WRONG: Masks race condition
async_operation()
Process.sleep(100)  # Sometimes not enough
verify_result()

# RIGHT: Exposes race condition, forces proper fix
assert_telemetry_event [:system, :operation, :completed], %{} do
  async_operation()
end
verify_result()  # If this fails, there's a real bug to fix

Error Pattern 3: “Slow test suite”

Symptom: Test suite takes minutes to run

Root Cause: Cumulative sleep delays

Solution: Event-driven testing eliminates unnecessary waiting

# WRONG: 2000ms of forced waiting per test
Process.sleep(2000)

# RIGHT: Completes in 5-50ms typically  
assert_telemetry_event [:system, :ready], %{}

Implementation Guidelines

Guideline 1: Start with Event Design

Before writing the test, design the events:

What operation am I testing?
What events should this operation emit?
What metadata indicates successful completion?
What events indicate failure modes?

Guideline 2: Test-Driven Event Design

Let test requirements drive telemetry design:

# If the test needs this event...
assert_telemetry_event [:myapp, :user, :registered], %{user_id: user_id}

# Then the implementation must emit it
def register_user(user_params) do
  # ... registration logic ...
  
  :telemetry.execute([:myapp, :user, :registered], %{}, %{
    user_id: user.id,
    timestamp: DateTime.utc_now()
  })
  
  {:ok, user}
end

Guideline 3: Fail Fast on Missing Events

Make missing events obvious:

# Use reasonable timeouts that fail fast
assert_telemetry_event [:system, :ready], %{}, timeout: 1000

# Better to fail fast than wait forever
# If event doesn't occur in 1 second, something is wrong

Guideline 4: Comprehensive Event Coverage

Emit events for all significant operations:

Process startup/shutdown
State transitions
Resource allocation/deallocation
External system calls
Error conditions
Performance milestones
Configuration changes

Metrics and Success Criteria

Test Suite Health Metrics

Sleep Usage: Target 0 instances of Process.sleep/1 in tests
Flaky Test Rate: Target <0.1% flaky test failures
Test Suite Speed: Target 50%+ faster execution
Coverage: 95%+ of async operations have event-driven tests

Code Quality Metrics

Event Coverage: All services emit operational telemetry
Test Clarity: Tests clearly express what they’re waiting for
Determinism: Tests pass reliably under load
Maintainability: New developers can understand test intent

Anti-Pattern Detection

Code Review Checklist

No Process.sleep/1 in test files (exceptions documented)
Async operations have corresponding telemetry events
Tests use assert_telemetry_event instead of arbitrary delays
Event-driven patterns used for all async coordination
Fallback patterns documented for edge cases

Automated Detection

Credo Rule: Ban Process.sleep/1 in test files CI Check: Fail builds with sleep-based tests Metrics: Track event-driven test adoption rate

Philosophy Summary

Core Beliefs

Systems should be self-describing - Operations emit events describing their lifecycle
Tests should be deterministic - Wait for specific conditions, not arbitrary time
Async operations need async coordination - Event-driven patterns for async testing
Race conditions should be exposed, not hidden - Proper synchronization over timing luck
Fast feedback loops - Tests complete as soon as conditions are met

The Goal

Transform testing from “guessing game” to “conversation with the system”:

Instead of guessing timing, listen to system events
Instead of hoping for completion, wait for explicit signals
Instead of masking race conditions, expose and fix them
Instead of slow, unreliable tests, fast, deterministic verification

When we achieve this, testing becomes a conversation between the test and the system, where the system explicitly tells us when it’s ready for the next assertion. This creates tests that are fast, reliable, clear, and maintainable.

Foundation Test Philosophy
Event-Driven Deterministic Testing
“The system should tell us when it’s ready”