JULY 1 2025 SLEEP TEST FIXES 01

Documentation for JULY_1_2025_SLEEP_TEST_FIXES_01 from the Foundation repository.

Sleep Test Fixes Strategy - July 1, 2025

Executive Summary

Following the successful fix of an intermittent test failure in agent_registry_test.exs, this document outlines a systematic approach to fixing the remaining 106 sleep instances across the Foundation test suite.

Context and References

Required Reading

test/SLEEPFIXES.md - Comprehensive 85+ sleep migration completed June 30, 2025
- Documents patterns for replacing Process.sleep with deterministic helpers
- Shows successful migration of task_agent_test.exs as gold standard
- Provides phase-by-phase migration strategy
test/THE_TELEMETRY_MANUAL.md - Foundation telemetry system documentation
- Event-driven testing without timing hacks
- Telemetry event structure and categories
- Test helper macros for event assertions
test/support/async_test_helpers.ex - Async testing utilities
- wait_for/3 function for polling conditions
- Acceptable replacement for Process.sleep in complex scenarios

Problem Analysis

Current State

106 instances of sleep patterns remain after June 30 migration
These represent either:
1. New tests added after the migration
2. Tests missed in the original sweep
3. Legitimate timing requirements (rare)

Sleep Anti-Pattern Categories

Process Lifecycle Synchronization (Most Common)

# BAD: Arbitrary wait for process death/restart
Process.exit(pid, :kill)
Process.sleep(100)
assert new_pid = Process.whereis(Name)

Async Operation Completion

# BAD: Fixed delay for background work
Task.async(fn -> do_work() end)
Process.sleep(500)
assert work_completed?()

State Change Verification

# BAD: Hope state changed after delay
CircuitBreaker.trip(service)
Process.sleep(50)
assert CircuitBreaker.open?(service)

Fix Strategy

Pattern 1: Telemetry-Based Synchronization (PREFERRED)

When to use: When the system emits telemetry events for state changes

Example: Agent registry cleanup (from today’s fix)

# GOOD: Wait for specific telemetry event
ref = make_ref()
:telemetry.attach(
  "test-#{inspect(ref)}",
  [:foundation, :mabeam, :registry, :agent_down],
  fn _event, _measurements, metadata, config ->
    if metadata.agent_id == agent_id do
      send(config.test_pid, {:agent_cleaned_up, metadata})
    end
  end,
  %{test_pid: self()}
)

Process.exit(agent_pid, :kill)
assert_receive {:agent_cleaned_up, _metadata}, 5000
:telemetry.detach("test-#{inspect(ref)}")

Pattern 2: State Polling with wait_for

When to use: When no telemetry events available but state is queryable

Example: Service restart detection

# GOOD: Poll for state change
import Foundation.AsyncTestHelpers

old_pid = Process.whereis(ServiceName)
Process.exit(old_pid, :kill)

new_pid = wait_for(fn ->
  case Process.whereis(ServiceName) do
    nil -> nil
    pid when pid != old_pid -> pid
    _ -> nil
  end
end, 5000)

assert is_pid(new_pid)
assert new_pid != old_pid

Pattern 3: Synchronous Task Completion

When to use: For Task-based async operations

Example: Background work completion

# GOOD: Use Task.await for synchronous completion
task = Task.async(fn -> expensive_computation() end)
result = Task.await(task, 10_000)
assert {:ok, _} = result

Pattern 4: Message-Based Synchronization

When to use: When you control both sides of communication

Example: Process coordination

# GOOD: Use messages for synchronization
test_pid = self()
worker = spawn(fn ->
  do_setup()
  send(test_pid, :ready)
  receive do
    :continue -> do_work()
  end
end)

assert_receive :ready, 1000
send(worker, :continue)

Implementation Plan

Phase 1: Critical Infrastructure Tests (HIGH PRIORITY)

Files with timing-sensitive operations that affect system stability:

circuit_breaker_test.exs - State transitions need telemetry events
resource_manager_test.exs - Resource lifecycle events
cache_telemetry_test.exs - Cache operation events

Phase 2: Integration Tests (MEDIUM PRIORITY)

Files testing cross-component interactions:

integration_validation_test.exs - Service coordination
supervision_crash_recovery_test.exs - OTP supervision
signal_routing_test.exs - Message delivery

Phase 3: Agent/Sensor Tests (LOW PRIORITY)

Files with agent lifecycle testing:

foundation_agent_test.exs - Agent state changes
system_health_sensor_test.exs - Periodic measurements
task_agent_test.exs - Already partially migrated

Key Principles

No Arbitrary Delays: Every wait must be for a specific, observable condition
Use System Events: Prefer telemetry/message-based synchronization
Bounded Waits: All async operations must have reasonable timeouts
Clear Intent: The wait condition should document what we’re waiting for
Test Isolation: Each test should clean up its telemetry handlers

Common Pitfalls to Avoid

Don’t Replace Sleep with Sleep: Using :timer.sleep is just as bad
Don’t Poll Too Frequently: Use reasonable intervals (10-50ms)
Don’t Ignore Timeouts: Failed waits should fail the test
Don’t Leak Handlers: Always detach telemetry handlers in test cleanup

Success Metrics

Zero sleep instances in test files (except async_test_helpers.ex internals)
Reduced test execution time by eliminating unnecessary delays
Zero intermittent failures from timing issues
Clear test intent through explicit wait conditions

Example Migration

Before (Intermittent Failure)

test "automatically removes agent when process dies" do
  short_lived = spawn(fn -> :timer.sleep(10) end)
  :ok = Foundation.register("agent", short_lived, metadata, registry)
  :timer.sleep(50)  # Hope cleanup happened
  assert :error = Foundation.lookup("agent", registry)
end

After (Deterministic)

test "automatically removes agent when process dies" do
  test_pid = self()
  short_lived = spawn(fn ->
    send(test_pid, :ready)
    receive do: (:exit -> :ok)
  end)
  
  assert_receive :ready
  :ok = Foundation.register("agent", short_lived, metadata, registry)
  
  # Set up telemetry handler for cleanup event
  ref = make_ref()
  :telemetry.attach(
    "test-#{inspect(ref)}",
    [:foundation, :mabeam, :registry, :agent_down],
    fn _, _, metadata, config ->
      if metadata.agent_id == "agent" do
        send(config.test_pid, :agent_cleaned_up)
      end
    end,
    %{test_pid: test_pid}
  )
  
  Process.exit(short_lived, :kill)
  assert_receive :agent_cleaned_up, 5000
  assert :error = Foundation.lookup("agent", registry)
  
  :telemetry.detach("test-#{inspect(ref)}")
end

Next Steps

Audit: Run detailed analysis of all 106 sleep instances
Categorize: Group by pattern type and priority
Migrate: Fix files in priority order using appropriate patterns
Validate: Ensure all tests pass reliably under load
Prevent: Add CI checks to prevent new sleep usage

Concrete Examples from Current Codebase

Example 1: Circuit Breaker State Transition (circuit_breaker_test.exs)

# CURRENT (line 120)
# Wait for recovery timeout (150ms as configured)
:timer.sleep(150)
:ok = CircuitBreaker.reset(service_id)

# SHOULD BE:
# Wait for circuit breaker timeout event
ref = make_ref()
:telemetry.attach(
  "test-cb-#{inspect(ref)}",
  [:foundation, :circuit_breaker, :recovery_timeout],
  fn _, _, %{service_id: id}, config ->
    if id == service_id do
      send(config.test_pid, :recovery_timeout_reached)
    end
  end,
  %{test_pid: self()}
)

assert_receive :recovery_timeout_reached, 1000
:ok = CircuitBreaker.reset(service_id)
:telemetry.detach("test-cb-#{inspect(ref)}")

Example 2: Process Leak Detection (supervision_crash_recovery_test.exs)

# CURRENT (line ~25)
on_exit(fn ->
  # Wait for cleanup (minimal delay)
  :timer.sleep(20)
  # Verify no process leaks
  final_process_count = :erlang.system_info(:process_count)
end)

# SHOULD BE:
on_exit(fn ->
  # Wait for all supervised processes to terminate
  import Foundation.AsyncTestHelpers
  
  wait_for(fn ->
    current_count = :erlang.system_info(:process_count)
    if current_count <= initial_process_count + tolerance do
      true
    else
      nil
    end
  end, 1000)
end)

Example 3: Batch Processing Completion

# CURRENT (various files)
TaskPoolManager.execute_batch(:pool, items, fn i ->
  :timer.sleep(10)  # Simulate work
  i * 10
end)

# SHOULD BE:
# For tests, use synchronous execution or Task.await
task = Task.async(fn ->
  TaskPoolManager.execute_batch(:pool, items, fn i ->
    # Actual work without sleep
    i * 10
  end)
end)
results = Task.await(task, 5000)
assert length(results) == length(items)

File-Specific Recommendations

High-Priority Files (System Critical)

circuit_breaker_test.exs
- Add telemetry events for state transitions
- Replace recovery timeout sleep with event
resource_manager_test.exs
- Use resource acquisition/release events
- Poll resource availability state
cache_telemetry_test.exs
- Already uses telemetry, just needs sleep removal
- Use existing cache hit/miss events

Medium-Priority Files (Integration Tests)

supervision_crash_recovery_test.exs
- Multiple process lifecycle sleeps
- Use process monitoring and wait_for
integration_validation_test.exs
- Service restart verification sleeps
- Use Process.whereis polling pattern
signal_routing_test.exs
- Message delivery confirmation
- Use receive with timeout

Conclusion

The successful migration of agent_registry_test.exs demonstrates that all sleep-based synchronization can be replaced with deterministic patterns. By following the strategies outlined in this document and leveraging Foundation’s telemetry system, we can eliminate the remaining 106 instances and achieve a fully deterministic test suite.