← Back to Test old20250712

TESTING GUIDE OTP

Documentation for TESTING_GUIDE_OTP from the Foundation repository.

Of course. I can see why you’re frustrated. The heavy use of Process.sleep in a test suite is a significant red flag that indicates deeper issues with concurrency management and test design. It leads to slow, flaky, and unreliable tests.

Let’s break down the problems comprehensively and create a clear, actionable plan to fix them.

Comprehensive Overview of Design Flaws

The core problem is a lack of deterministic synchronization in asynchronous tests. Instead of waiting for a specific state or event to occur, the tests are “guessing” how long an operation will take and sleeping for that duration. This is fundamentally flawed.

Here are the specific design flaws this creates:

1. Over-reliance on Process.sleep for Asynchronous Testing

This is the most critical and visible flaw. Process.sleep is used as a blunt instrument to wait for asynchronous operations to complete, which is a major anti-pattern.

  • Why it’s a problem:

    • Flakiness: If the system is under heavy load or running on a slower machine, the sleep duration may not be long enough, causing the test to fail.
    • Slowness: The test suite is artificially slowed down by these fixed waits. A Process.sleep(200) call forces the test to wait even if the operation finished in 5ms. Cumulatively, this adds minutes to the test run.
    • Masking Race Conditions: It hides, rather than solves, underlying race conditions. The code isn’t guaranteed to be correct; it just happens to work most of the time on the developer’s machine.
  • Code Examples:

    • foundation/infrastructure/circuit_breaker_test.exs: Process.sleep(50) is used to “Give fuse time to register the failures.” The test should be waiting for the circuit breaker’s state to change, not for a fixed time.
    • foundation/services/connection_manager_test.exs: Process.sleep(200) is used to “Wait a moment for restart” after killing a process. The test should deterministically wait for the supervisor to restart the child.
    • jido_foundation/mabeam_coordination_test.exs: Process.sleep(50) is used to “Allow message processing.” The test should wait for a confirmation message from the agent or check its state directly.

2. Inconsistent Test Isolation and Setup

The codebase has a mix of testing patterns, indicating a transition period that was never completed.

  • The Good: The support/unified_test_foundation.ex and support/test_isolation.ex modules are excellent. They provide a robust framework for creating fully isolated test environments with their own supervisors, registries, and signal buses. This is the correct way to write concurrent tests in OTP.
  • The Bad: Many tests (circuit_breaker_test.exs, rate_limiter_test.exs, etc.) do not use this foundation. They rely on global processes (Process.whereis(Foundation.Infrastructure.CircuitBreaker)) or manually manage setup and teardown, which is prone to contamination and race conditions between tests running in parallel.

3. Lack of Deterministic Event and State Verification

Asynchronous systems communicate through messages and state changes. The tests rarely use these mechanisms for verification.

  • Instead of Process.sleep(200) to wait for a restart: The test should monitor the process and use assert_receive {:DOWN, ...} to confirm the old process is dead, then poll Process.whereis until the new process appears. The support/async_test_helpers.ex file already contains a wait_for/3 function for this exact purpose.
  • Instead of Process.sleep(50) to wait for a GenServer to process a message: The test should either use GenServer.call (which is synchronous) to trigger the action and verify the state in the reply, or if using cast, send a subsequent call to a “sync” function that confirms the state.
  • Instead of Process.sleep(50) to wait for a telemetry event: The tests should use :telemetry.attach and assert_receive to wait for the specific event to be emitted. This is both faster and 100% reliable. The assert_telemetry_event macro in async_test_helpers.ex is perfect for this.

A Plan to Fix The Codebase

The good news is that the tools to fix this are already in your repository. The problem is inconsistent application. The plan is to systematically refactor the tests to use the modern, robust patterns available in support/.

Phase 1: Eradicate Process.sleep with Deterministic Helpers

This will provide the biggest and most immediate improvement in test suite speed and reliability.

Action: Go through every test file and replace every Process.sleep call with a deterministic alternative.

  1. Waiting for a process to die/restart:

    • Before:
      Process.exit(manager_pid, :kill)
      Process.sleep(200) // Hope it restarted
      new_manager_pid = Process.whereis(Foundation.Services.ConnectionManager)
      
    • After (using wait_for from your helpers):
      import Foundation.AsyncTestHelpers
      
      Process.exit(manager_pid, :kill)
      // Deterministically wait for the supervisor to restart the process
      new_manager_pid = wait_for(fn ->
        case Process.whereis(Foundation.Services.ConnectionManager) do
          pid when pid != manager_pid and is_pid(pid) -> pid
          _ -> nil
        end
      end, 5000)
      assert is_pid(new_manager_pid)
      
  2. Waiting for a state change in a GenServer:

    • Before:
      CircuitBreaker.call(service_id, fn -> raise "fail" end)
      Process.sleep(50) // Hope the state is updated
      {:ok, status} = CircuitBreaker.get_status(service_id)
      assert status == :open
      
    • After (using wait_for):
      import Foundation.AsyncTestHelpers
      
      CircuitBreaker.call(service_id, fn -> raise "fail" end)
      // Deterministically wait for the state to become :open
      wait_for(fn ->
        case CircuitBreaker.get_status(service_id) do
          {:ok, :open} -> true
          _ -> nil
        end
      end)
      // Now we can safely assert
      {:ok, :open} = CircuitBreaker.get_status(service_id)
      
  3. Waiting for a Telemetry Event:

    • Before:
      // Trigger action...
      Process.sleep(50) // Hope event was sent
      // Assert some side-effect
      
    • After (using the existing assert_telemetry_event helper):
      import Foundation.AsyncTestHelpers
      
      assert_telemetry_event [:foundation, :circuit_breaker, :call_success], %{count: 1} do
        CircuitBreaker.call(service_id, fn -> {:ok, :success} end)
      end
      

Phase 2: Systematic Migration to UnifiedTestFoundation

Every test file should be using the unified foundation for proper isolation.

Action: For each test file NOT using use Foundation.UnifiedTestFoundation, perform the following migration.

  1. Change the use directive:

    • Before: use ExUnit.Case, async: true
    • After: use Foundation.UnifiedTestFoundation, :registry (or another appropriate mode). The TESTING_GUIDE.md provides guidance on which mode to use.
  2. Remove manual setup: Delete any manual MABEAM.AgentRegistry.start_link or other setup code that is now handled automatically by the foundation.

  3. Use the provided context: Refactor tests to accept and use the context map provided by the setup.

    • Before: test "my test" do ... end
    • After: test "my test", %{registry: registry, test_context: ctx} do ... end
  4. Enable async: true: Once a test file is properly isolated with the foundation, you can almost always enable async: true, drastically improving parallelism.

Phase 3: Enforce the Standard with CI and Tooling

To prevent this from happening again, you need to enforce the new standards.

  1. Add a Credo Check: Implement a custom Credo check that completely bans Process.sleep/1 from all files inside foundation/, jido_foundation/, and jido_system/. This will fail the CI build if anyone tries to add it back.
  2. Update TESTING_GUIDE.md: Explicitly state that Process.sleep is forbidden in tests and that all new tests must use Foundation.UnifiedTestFoundation.
  3. Code Reviews: Make “No Process.sleep” and “Uses UnifiedTestFoundation” mandatory checklist items for any PR that touches tests.

By following this plan, you will transform your test suite from a slow, flaky liability into a fast, reliable, and deterministic asset that gives you true confidence in your concurrent system. The foundation for this is already built; it just needs to be used everywhere.