Of course. I can see why you’re frustrated. The heavy use of Process.sleep
in a test suite is a significant red flag that indicates deeper issues with concurrency management and test design. It leads to slow, flaky, and unreliable tests.
Let’s break down the problems comprehensively and create a clear, actionable plan to fix them.
Comprehensive Overview of Design Flaws
The core problem is a lack of deterministic synchronization in asynchronous tests. Instead of waiting for a specific state or event to occur, the tests are “guessing” how long an operation will take and sleeping for that duration. This is fundamentally flawed.
Here are the specific design flaws this creates:
1. Over-reliance on Process.sleep
for Asynchronous Testing
This is the most critical and visible flaw. Process.sleep
is used as a blunt instrument to wait for asynchronous operations to complete, which is a major anti-pattern.
Why it’s a problem:
- Flakiness: If the system is under heavy load or running on a slower machine, the sleep duration may not be long enough, causing the test to fail.
- Slowness: The test suite is artificially slowed down by these fixed waits. A
Process.sleep(200)
call forces the test to wait even if the operation finished in 5ms. Cumulatively, this adds minutes to the test run. - Masking Race Conditions: It hides, rather than solves, underlying race conditions. The code isn’t guaranteed to be correct; it just happens to work most of the time on the developer’s machine.
Code Examples:
foundation/infrastructure/circuit_breaker_test.exs
:Process.sleep(50)
is used to “Give fuse time to register the failures.” The test should be waiting for the circuit breaker’s state to change, not for a fixed time.foundation/services/connection_manager_test.exs
:Process.sleep(200)
is used to “Wait a moment for restart” after killing a process. The test should deterministically wait for the supervisor to restart the child.jido_foundation/mabeam_coordination_test.exs
:Process.sleep(50)
is used to “Allow message processing.” The test should wait for a confirmation message from the agent or check its state directly.
2. Inconsistent Test Isolation and Setup
The codebase has a mix of testing patterns, indicating a transition period that was never completed.
- The Good: The
support/unified_test_foundation.ex
andsupport/test_isolation.ex
modules are excellent. They provide a robust framework for creating fully isolated test environments with their own supervisors, registries, and signal buses. This is the correct way to write concurrent tests in OTP. - The Bad: Many tests (
circuit_breaker_test.exs
,rate_limiter_test.exs
, etc.) do not use this foundation. They rely on global processes (Process.whereis(Foundation.Infrastructure.CircuitBreaker)
) or manually manage setup and teardown, which is prone to contamination and race conditions between tests running in parallel.
3. Lack of Deterministic Event and State Verification
Asynchronous systems communicate through messages and state changes. The tests rarely use these mechanisms for verification.
- Instead of
Process.sleep(200)
to wait for a restart: The test should monitor the process and useassert_receive {:DOWN, ...}
to confirm the old process is dead, then pollProcess.whereis
until the new process appears. Thesupport/async_test_helpers.ex
file already contains await_for/3
function for this exact purpose. - Instead of
Process.sleep(50)
to wait for a GenServer to process a message: The test should either useGenServer.call
(which is synchronous) to trigger the action and verify the state in the reply, or if usingcast
, send a subsequentcall
to a “sync” function that confirms the state. - Instead of
Process.sleep(50)
to wait for a telemetry event: The tests should use:telemetry.attach
andassert_receive
to wait for the specific event to be emitted. This is both faster and 100% reliable. Theassert_telemetry_event
macro inasync_test_helpers.ex
is perfect for this.
A Plan to Fix The Codebase
The good news is that the tools to fix this are already in your repository. The problem is inconsistent application. The plan is to systematically refactor the tests to use the modern, robust patterns available in support/
.
Phase 1: Eradicate Process.sleep
with Deterministic Helpers
This will provide the biggest and most immediate improvement in test suite speed and reliability.
Action: Go through every test file and replace every Process.sleep
call with a deterministic alternative.
Waiting for a process to die/restart:
- Before:
Process.exit(manager_pid, :kill) Process.sleep(200) // Hope it restarted new_manager_pid = Process.whereis(Foundation.Services.ConnectionManager)
- After (using
wait_for
from your helpers):import Foundation.AsyncTestHelpers Process.exit(manager_pid, :kill) // Deterministically wait for the supervisor to restart the process new_manager_pid = wait_for(fn -> case Process.whereis(Foundation.Services.ConnectionManager) do pid when pid != manager_pid and is_pid(pid) -> pid _ -> nil end end, 5000) assert is_pid(new_manager_pid)
- Before:
Waiting for a state change in a GenServer:
- Before:
CircuitBreaker.call(service_id, fn -> raise "fail" end) Process.sleep(50) // Hope the state is updated {:ok, status} = CircuitBreaker.get_status(service_id) assert status == :open
- After (using
wait_for
):import Foundation.AsyncTestHelpers CircuitBreaker.call(service_id, fn -> raise "fail" end) // Deterministically wait for the state to become :open wait_for(fn -> case CircuitBreaker.get_status(service_id) do {:ok, :open} -> true _ -> nil end end) // Now we can safely assert {:ok, :open} = CircuitBreaker.get_status(service_id)
- Before:
Waiting for a Telemetry Event:
- Before:
// Trigger action... Process.sleep(50) // Hope event was sent // Assert some side-effect
- After (using the existing
assert_telemetry_event
helper):import Foundation.AsyncTestHelpers assert_telemetry_event [:foundation, :circuit_breaker, :call_success], %{count: 1} do CircuitBreaker.call(service_id, fn -> {:ok, :success} end) end
- Before:
Phase 2: Systematic Migration to UnifiedTestFoundation
Every test file should be using the unified foundation for proper isolation.
Action: For each test file NOT using use Foundation.UnifiedTestFoundation
, perform the following migration.
Change the
use
directive:- Before:
use ExUnit.Case, async: true
- After:
use Foundation.UnifiedTestFoundation, :registry
(or another appropriate mode). TheTESTING_GUIDE.md
provides guidance on which mode to use.
- Before:
Remove manual setup: Delete any manual
MABEAM.AgentRegistry.start_link
or other setup code that is now handled automatically by the foundation.Use the provided context: Refactor tests to accept and use the context map provided by the setup.
- Before:
test "my test" do ... end
- After:
test "my test", %{registry: registry, test_context: ctx} do ... end
- Before:
Enable
async: true
: Once a test file is properly isolated with the foundation, you can almost always enableasync: true
, drastically improving parallelism.
Phase 3: Enforce the Standard with CI and Tooling
To prevent this from happening again, you need to enforce the new standards.
- Add a Credo Check: Implement a custom Credo check that completely bans
Process.sleep/1
from all files insidefoundation/
,jido_foundation/
, andjido_system/
. This will fail the CI build if anyone tries to add it back. - Update
TESTING_GUIDE.md
: Explicitly state thatProcess.sleep
is forbidden in tests and that all new tests must useFoundation.UnifiedTestFoundation
. - Code Reviews: Make “No
Process.sleep
” and “UsesUnifiedTestFoundation
” mandatory checklist items for any PR that touches tests.
By following this plan, you will transform your test suite from a slow, flaky liability into a fast, reliable, and deterministic asset that gives you true confidence in your concurrent system. The foundation for this is already built; it just needs to be used everywhere.