Sleep Test Fixes Phase 2 - Comprehensive Plan
Overview
Following successful Phase 1 fixes (19 instances), we need to systematically fix the remaining ~65 sleep instances across the Foundation test suite.
Current Status
Phase 1 Achievements:
- Fixed 19 sleep instances
- Established proven patterns
- All affected tests now passing
- Uncovered and resolved fundamental architectural issues
Remaining Work:
- ~65 sleep instances across multiple test files
- Various patterns requiring different solutions
- Some legitimate uses (TTL testing) to preserve
Categorized Fix Plan
Category 1: Process Lifecycle Management (HIGH PRIORITY)
Pattern: Tests waiting for process death/restart
Solution: Use process monitoring and wait_for
Files to fix:
supervision_crash_recovery_test.exs
- Multiple process restart waitsresource_leak_detection_test.exs
- Process cleanup verificationfoundation_agent_test.exs
- Agent lifecycle testing
Example fix:
# BEFORE
Process.exit(pid, :kill)
Process.sleep(50)
# AFTER
ref = Process.monitor(pid)
Process.exit(pid, :kill)
assert_receive {:DOWN, ^ref, :process, ^pid, _}, 5000
Category 2: State Change Verification (HIGH PRIORITY)
Pattern: Waiting for GenServer state updates
Solution: Use wait_for
with state polling
Files to fix:
circuit_breaker_test.exs
- State transition verification (already partially fixed)rate_limiter_test.exs
- Rate limit window updatescache_test.exs
- Cache operation verification
Example fix:
# BEFORE
GenServer.cast(server, :update)
Process.sleep(100)
state = GenServer.call(server, :get_state)
# AFTER
GenServer.cast(server, :update)
wait_for(fn ->
case GenServer.call(server, :get_state) do
%{updated: true} -> true
_ -> nil
end
end)
Category 3: Telemetry Event Synchronization (MEDIUM PRIORITY)
Pattern: Waiting for telemetry events Solution: Use telemetry event handlers with assert_receive
Files to fix:
cache_telemetry_test.exs
- Cache events (1 legitimate TTL sleep to keep)telemetry_test.exs
- General telemetry testingspan_test.exs
- Span timing verification
Example fix:
# BEFORE
perform_operation()
Process.sleep(50)
assert telemetry_received?()
# AFTER
ref = make_ref()
:telemetry.attach(
"test-#{inspect(ref)}",
[:foundation, :operation, :complete],
fn _, _, metadata, config ->
send(config.test_pid, {:event_received, metadata})
end,
%{test_pid: self()}
)
perform_operation()
assert_receive {:event_received, _}, 5000
:telemetry.detach("test-#{inspect(ref)}")
Category 4: Background Work Simulation (LOW PRIORITY)
Pattern: Simulating work with sleep Solution: Remove sleep or use actual computation
Files to fix:
batch_operations_test.exs
- Work simulationperformance_benchmark_test.exs
- Load testingload_test_test.exs
- Performance testing
Example fix:
# BEFORE
fn item ->
Process.sleep(10) # Simulate work
item * 2
end
# AFTER
fn item ->
# Do actual computation or just return
item * 2
end
Category 5: Test Process Management (LOW PRIORITY)
Pattern: Spawning processes with sleep(:infinity) Solution: Usually fine, but can use receive blocks
Files to fix:
serial_operations_test.exs
- Test process spawningatomic_transaction_test.exs
- Transaction testing
Note: spawn(fn -> :timer.sleep(:infinity) end)
is generally acceptable for keeping test processes alive.
Category 6: Legitimate Time-Based Testing (PRESERVE)
Pattern: Testing actual time-based features Solution: Keep minimal sleep with clear documentation
Files to preserve:
cache_telemetry_test.exs
- TTL expiry testing (15ms)sampler_test.exs
- Time-based sampling
Implementation Strategy
Phase 2A: Critical Infrastructure (Days 1-2)
- Fix all process lifecycle sleeps in supervision tests
- Fix state change verification in core services
- Run full test suite after each file
Phase 2B: Service Layer (Days 3-4)
- Fix telemetry synchronization
- Fix rate limiting and cache tests
- Verify no performance regression
Phase 2C: Cleanup (Day 5)
- Remove work simulation sleeps
- Document legitimate time-based tests
- Add CI checks to prevent regression
Success Metrics
- Zero sleep instances (except documented legitimate uses)
- All tests passing reliably
- Test execution time reduced by >50%
- No intermittent failures in CI
- Clear documentation for any remaining time-based tests
Tools and Patterns
Available Helpers:
Foundation.AsyncTestHelpers.wait_for/3
assert_telemetry_event
macro- Process monitoring patterns
- Task.await for synchronous completion
Key Principles:
- Deterministic over time-based - Always prefer explicit synchronization
- Event-driven testing - Use telemetry and messages
- State verification - Poll for expected state changes
- Document exceptions - Any remaining sleep must have clear justification
Risk Mitigation
- Test each fix - Run affected test file after changes
- Preserve semantics - Ensure test still validates intended behavior
- Watch for cascading failures - Some tests may depend on timing
- Keep audit trail - Document all changes in worklog
Next Steps
- Start with
supervision_crash_recovery_test.exs
(highest sleep count) - Apply patterns systematically
- Run tests frequently
- Update worklog with progress
- Create PR after each major category
This plan provides a clear path to eliminating the remaining sleep instances while maintaining test reliability and improving overall test suite performance.