AUDIT 02 planSteps 02

Documentation for AUDIT_02_planSteps_02 from the Foundation repository.

OTP Implementation Plan - Stage 2: Test Suite Remediation

Generated: July 2, 2025 Duration: Weeks 2-3 (10 days) Status: Ready for Implementation

Overview

This document details Stage 2 of the OTP remediation plan, focusing on eliminating test anti-patterns that create flaky, slow, and unreliable tests. This stage transforms the test suite to use deterministic, OTP-compliant patterns.

Context Documents

Parent Plan: AUDIT_02_plan.md - Full remediation strategy
Stage 1: AUDIT_02_planSteps_01.md - Enforcement infrastructure (must be completed first)
Original Audit: JULY_1_2025_PRE_PHASE_2_OTP_report_01_AUDIT_01.md - Initial findings
Test Guide: test/TESTING_GUIDE_OTP.md - Acknowledgment of test issues
Test Helpers: test/support/async_test_helpers.ex - Existing deterministic helpers

Current State

Process.sleep Usage (26 occurrences in 6 files)

test/foundation/race_condition_test.exs - 9 occurrences
test/foundation/monitor_leak_test.exs - 7 occurrences
test/foundation/telemetry/load_test_test.exs - 5 occurrences
test/foundation/telemetry/sampler_test.exs - 3 occurrences
test/mabeam/agent_registry_test.exs - 1 occurrence
test/telemetry_performance_comparison.exs - 1 occurrence

Other Anti-patterns

Raw spawning: 13+ files using spawn/1 without supervision
GenServer.call without timeouts: 17+ files
Direct state access: :sys.get_state/1 usage
Missing isolation: Tests using global processes
Resource leaks: ETS tables and processes not cleaned up

Stage 2 Deliverables

2.1 Process.sleep Elimination

Priority: CRITICAL
Time Estimate: 5 days

Step 1: Create Migration Guide

Location: test/SLEEP_MIGRATION_GUIDE.md

# Process.sleep Migration Guide

## Why This Matters
Process.sleep creates flaky tests that:
- Fail randomly under load
- Waste time on fast systems  
- Hide real race conditions
- Make CI unreliable

## Migration Patterns

### Pattern 1: Waiting for Process Restart
**Symptom**: Sleep after killing a process to wait for supervisor restart

#### Before:
```elixir
Process.exit(manager_pid, :kill)
Process.sleep(200)  # Hope it restarted
new_pid = Process.whereis(MyServer)

After:

import Foundation.AsyncTestHelpers

old_pid = manager_pid
Process.exit(old_pid, :kill)

# Wait for supervisor to start new process
new_pid = wait_for(fn ->
  case Process.whereis(MyServer) do
    pid when is_pid(pid) and pid != old_pid -> pid
    _ -> nil
  end
end, 5000)  # 5 second timeout

assert new_pid != old_pid

Pattern 2: Waiting for State Change

Symptom: Sleep after triggering action, then check state

Before:

CircuitBreaker.record_failure(service)
Process.sleep(50)  # Wait for state update
{:ok, :open} = CircuitBreaker.get_status(service)

After:

import Foundation.AsyncTestHelpers

CircuitBreaker.record_failure(service)

# Wait for specific state
wait_for(fn ->
  case CircuitBreaker.get_status(service) do
    {:ok, :open} -> true
    _ -> false
  end
end, 1000)

{:ok, :open} = CircuitBreaker.get_status(service)

Pattern 3: Rate Limit Window Expiry

Symptom: Sleep to wait for time window to pass

Before:

# Use up rate limit
for _ <- 1..5, do: RateLimiter.check(key)
Process.sleep(60)  # Wait for window
assert :ok = RateLimiter.check(key)

After:

import Foundation.AsyncTestHelpers

# Use up rate limit
for _ <- 1..5, do: RateLimiter.check(key)

# Wait for window to actually expire
wait_for(fn ->
  case RateLimiter.check(key) do
    :ok -> true
    {:error, :rate_limited} -> false
  end
end, 100)  # Should be quick

assert :ok = RateLimiter.check(key)

Pattern 4: Telemetry Events

Symptom: Sleep hoping telemetry event was emitted

Before:

MyModule.do_work()
Process.sleep(50)
# Manually check telemetry was called

After:

import Foundation.AsyncTestHelpers

assert_telemetry_event [:my_module, :work_done], %{result: :ok} do
  MyModule.do_work()
end

Pattern 5: Message Processing

Symptom: Sleep to allow GenServer to process messages

Before:

GenServer.cast(server, :do_something)
Process.sleep(50)  # Let it process
assert GenServer.call(server, :get_state) == :expected

After:

# Option 1: Use call instead of cast
result = GenServer.call(server, :do_something_sync)
assert result == :expected

# Option 2: Add sync function
GenServer.cast(server, :do_something)
:ok = GenServer.call(server, :sync)  # Waits for cast to process
assert GenServer.call(server, :get_state) == :expected

# Option 3: Use wait_for
GenServer.cast(server, :do_something)
wait_for(fn ->
  GenServer.call(server, :get_state) == :expected
end)

Pattern 6: Clearing State

Symptom: Sleep to “clear” state between tests

Before:

Process.sleep(10)  # Clear any existing state

After:

# Option 1: Explicit cleanup
:ok = RateLimiter.reset_all()

# Option 2: Wait for specific clean state
wait_for(fn ->
  RateLimiter.get_metrics() == %{requests: 0}
end, 100)

# Option 3: Use isolated test setup
use Foundation.UnifiedTestFoundation, :full_isolation

Special Cases

Load Testing

Load tests may need controlled timing. Use Process.yield() instead:

# Before
Process.sleep(1)  # Simulate fast operation

# After  
Process.yield()  # Let scheduler run other processes

Benchmarking

For benchmarks that need consistent timing:

# Use :timer.tc/1 for measurements instead of sleep
{time, result} = :timer.tc(fn -> do_work() end)
assert time < 1000  # microseconds

Verification

After migrating a test file:

Run it 100 times: for i in {1..100}; do mix test path/to/test.exs || break; done
Run under load: mix test --max-cases 32 path/to/test.exs
Check it’s faster: Compare before/after run times


#### Step 2: Systematic File Migration

For each file, follow this process:

##### File 1: `test/foundation/race_condition_test.exs` (9 sleeps)
**Time: 4 hours**

```elixir
# Add at top of test module
import Foundation.AsyncTestHelpers

# Migration for each sleep pattern:

# BEFORE: Clear state sleep
test "handles concurrent requests correctly" do
  Process.sleep(10)  # Clear any existing state
  
# AFTER: Deterministic state verification  
test "handles concurrent requests correctly" do
  # Wait for clean state
  wait_for(fn ->
    RateLimiter.get_window_count(:test_concurrent, "user1") == 0
  end, 100)

# BEFORE: Window expiry sleep
Process.sleep(60)  # Wait for window to expire

# AFTER: Check actual expiry
wait_for(fn ->
  RateLimiter.window_expired?(:test_window, "user1")
end, 100)

# BEFORE: Burst completion sleep
# Fire requests
Process.sleep(10)  # Let them complete

# AFTER: Wait for actual completion
# Fire requests in tasks
tasks = for i <- 1..20 do
  Task.async(fn -> 
    RateLimiter.check_rate_limit(:burst_test, "user_#{i}")
  end)
end

# Wait for all to complete
results = Task.await_many(tasks)

File 2: `test/foundation/monitor_leak_test.exs` (7 sleeps)

Time: 3 hours

# BEFORE: Wait for subscription
send(router, {:subscribe, "test.*", self()})
Process.sleep(50)  # Wait for subscription

# AFTER: Synchronous subscription
:ok = SignalRouter.subscribe_sync(router, "test.*", self())

# BEFORE: Wait for cleanup  
Process.exit(pid, :normal)
Process.sleep(200)  # Wait for cleanup

# AFTER: Verify cleanup completed
ref = Process.monitor(pid)
Process.exit(pid, :normal)
assert_receive {:DOWN, ^ref, :process, ^pid, :normal}, 1000

# Verify cleanup actually happened
wait_for(fn ->
  SignalRouter.get_subscriber_count(router, "test.*") == 99
end)

# BEFORE: Batch operation completion
Enum.each(pids, &Process.exit(&1, :kill))
Process.sleep(300)  # Wait for all cleanup

# AFTER: Monitor all and wait
refs = Enum.map(pids, &Process.monitor/1)
Enum.each(pids, &Process.exit(&1, :kill))

for ref <- refs do
  assert_receive {:DOWN, ^ref, :process, _, :killed}, 1000
end

wait_for(fn ->
  SignalRouter.get_subscriber_count(router, "test.*") == 0
end)

File 3: `test/foundation/telemetry/load_test_test.exs` (5 sleeps)

Time: 2 hours

# These sleeps simulate operation timing - different approach needed

# BEFORE: Simulate fast operation
run: fn _ctx ->
  Process.sleep(1)
  {:ok, :fast_result}
end

# AFTER: Use realistic operations or Process.yield
run: fn _ctx ->
  # Option 1: Do actual work
  _ = Enum.sum(1..100)
  {:ok, :fast_result}
  
  # Option 2: Just yield
  Process.yield()
  {:ok, :fast_result}
end

# BEFORE: Simulate slow operation  
Process.sleep(5)

# AFTER: Do actual work that takes time
run: fn _ctx ->
  # Simulate CPU work
  _ = Enum.reduce(1..10000, 0, fn i, acc ->
    :math.sqrt(i) + acc
  end)
  {:ok, :slow_result}
end

File 4: `test/foundation/telemetry/sampler_test.exs` (3 sleeps)

Time: 2 hours

# BEFORE: Wait for window
Process.sleep(1100)  # Wait for next window

# AFTER: Use time manipulation or wait for window change
import Foundation.AsyncTestHelpers

# Get current window
window1 = Sampler.current_window(:test_adaptive)

# Wait for window to change
wait_for(fn ->
  Sampler.current_window(:test_adaptive) != window1
end, 1500)

# BEFORE: Simulate event rate
for _ <- 1..1000 do
  Sampler.should_sample?([:test, :adaptive])
  Process.sleep(1)  # ~1000 events/sec
end

# AFTER: Use Task.async for concurrency
tasks = for _ <- 1..1000 do
  Task.async(fn ->
    Sampler.should_sample?([:test, :adaptive])
  end)
end

Task.await_many(tasks, 5000)

File 5: `test/mabeam/agent_registry_test.exs` (1 sleep)

Time: 1 hour

# Find and fix the single Process.sleep occurrence
# Similar pattern to above examples

File 6: `test/telemetry_performance_comparison.exs` (1 sleep)

Time: 1 hour

# This might be legitimate for performance comparison
# Consider if it's actually needed or can use Process.yield()

2.2 Test Isolation Implementation

Priority: HIGH
Time Estimate: 3 days

Step 1: Create Isolation Audit Script

Location: scripts/test_isolation_audit.exs

defmodule TestIsolationAuditor do
  @moduledoc """
  Identifies tests that need isolation improvements.
  Run with: mix run scripts/test_isolation_audit.exs
  """
  
  def run do
    IO.puts("=== Test Isolation Audit ===\n")
    
    test_files = Path.wildcard("test/**/*_test.exs")
    issues = analyze_files(test_files)
    
    generate_report(issues)
    generate_fix_script(issues)
  end
  
  defp analyze_files(files) do
    files
    |> Enum.map(&analyze_file/1)
    |> Enum.reject(fn {_, issues} -> issues == [] end)
    |> Map.new()
  end
  
  defp analyze_file(file) do
    content = File.read!(file)
    
    issues = []
    
    # Check for UnifiedTestFoundation usage
    if not String.contains?(content, "use Foundation.UnifiedTestFoundation") do
      issues = [{:missing_foundation, nil} | issues]
    end
    
    # Check for global process usage
    global_matches = Regex.scan(~r/Process\.whereis\(([\w\.:]+)\)/, content)
    if length(global_matches) > 0 do
      processes = Enum.map(global_matches, fn [_, process] -> process end)
      issues = [{:global_process, processes} | issues]
    end
    
    # Check for raw spawn
    spawn_matches = Regex.scan(~r/spawn\(/, content)
    if length(spawn_matches) > 0 do
      issues = [{:raw_spawn, length(spawn_matches)} | issues]
    end
    
    # Check for :sys.get_state
    if String.contains?(content, ":sys.get_state") do
      issues = [{:sys_get_state, nil} | issues]
    end
    
    # Check for ETS without cleanup
    if String.contains?(content, ":ets.new") and 
       not String.contains?(content, ":ets.delete") do
      issues = [{:ets_leak, nil} | issues]
    end
    
    {file, issues}
  end
  
  defp generate_report(issues) do
    IO.puts("Found #{map_size(issues)} files with isolation issues:\n")
    
    Enum.each(issues, fn {file, file_issues} ->
      IO.puts("#{file}:")
      Enum.each(file_issues, &print_issue/1)
      IO.puts("")
    end)
    
    IO.puts("\nSummary:")
    IO.puts("- Files needing UnifiedTestFoundation: #{count_issue(issues, :missing_foundation)}")
    IO.puts("- Files using global processes: #{count_issue(issues, :global_process)}")
    IO.puts("- Files with raw spawn: #{count_issue(issues, :raw_spawn)}")
    IO.puts("- Files using :sys.get_state: #{count_issue(issues, :sys_get_state)}")
    IO.puts("- Files with potential ETS leaks: #{count_issue(issues, :ets_leak)}")
  end
  
  defp print_issue({:missing_foundation, _}) do
    IO.puts("  ❌ Not using Foundation.UnifiedTestFoundation")
  end
  
  defp print_issue({:global_process, processes}) do
    IO.puts("  ❌ Using global processes: #{Enum.join(processes, ", ")}")
  end
  
  defp print_issue({:raw_spawn, count}) do
    IO.puts("  ❌ Raw spawn usage: #{count} occurrences")
  end
  
  defp print_issue({:sys_get_state, _}) do
    IO.puts("  ❌ Using :sys.get_state (breaks encapsulation)")
  end
  
  defp print_issue({:ets_leak, _}) do
    IO.puts("  ❌ ETS table creation without cleanup")
  end
  
  defp count_issue(issues, type) do
    issues
    |> Enum.count(fn {_, file_issues} ->
      Enum.any?(file_issues, fn {issue_type, _} -> issue_type == type end)
    end)
  end
  
  defp generate_fix_script(issues) do
    File.write!("test_isolation_fixes.exs", """
    # Auto-generated test isolation fixes
    # Review each change before applying
    
    defmodule TestIsolationFixer do
      def fix_all do
        #{Enum.map_join(issues, "\n    ", &generate_fix_call/1)}
      end
      
      #{Enum.map_join(issues, "\n  ", &generate_fix_function/1)}
    end
    
    TestIsolationFixer.fix_all()
    """)
    
    IO.puts("\nGenerated test_isolation_fixes.exs - Review before running!")
  end
  
  defp generate_fix_call({file, _issues}) do
    ~s|fix_file("#{file}")|
  end
  
  defp generate_fix_function({file, issues}) do
    """
    def fix_file("#{file}") do
      content = File.read!("#{file}")
      
      #{Enum.map_join(issues, "\n    ", &generate_fix_for_issue/1)}
      
      File.write!("#{file}", content)
      IO.puts("Fixed: #{file}")
    end
    """
  end
  
  defp generate_fix_for_issue({:missing_foundation, _}) do
    """
    # Add UnifiedTestFoundation
    content = Regex.replace(
      ~r/use ExUnit\.Case(, async: true)?/,
      content,
      "use Foundation.UnifiedTestFoundation, :registry"
    )
    """
  end
  
  defp generate_fix_for_issue(_), do: "# Manual fix needed"
end

# Run the audit
TestIsolationAuditor.run()

Step 2: Migration Patterns for Test Isolation

Pattern 1: Adding UnifiedTestFoundation

Files affected: All test files not using it

# BEFORE:
defmodule MyTest do
  use ExUnit.Case, async: true
  
  setup do
    # Manual setup
    {:ok, pid} = GenServer.start_link(MyServer, [])
    {:ok, server: pid}
  end

# AFTER:
defmodule MyTest do
  use Foundation.UnifiedTestFoundation, :registry
  
  setup %{registry: registry} do
    # Use isolated registry
    {:ok, pid} = Registry.register(registry, MyServer, [])
    {:ok, server: pid}
  end

Pattern 2: Replacing Global Process Access

Files affected: Tests using Process.whereis

# BEFORE:
test "interacts with global process" do
  pid = Process.whereis(Foundation.SomeServer)
  GenServer.call(pid, :action)
end

# AFTER:
test "interacts with isolated process", %{test_supervisor: supervisor} do
  # Start isolated instance
  {:ok, pid} = TestSupervisor.start_child(
    supervisor,
    {Foundation.SomeServer, name: unique_name()}
  )
  
  GenServer.call(pid, :action)
end

# Helper function
defp unique_name do
  :"#{__MODULE__}_#{System.unique_integer()}"
end

Pattern 3: Supervised Test Processes

Files affected: Tests using raw spawn

Create helper module test/support/supervised_test_process.ex:

defmodule Foundation.SupervisedTestProcess do
  @moduledoc """
  Replaces raw spawn with supervised processes in tests.
  """
  
  use GenServer
  
  def spawn_supervised(fun, opts \\ []) do
    supervisor = Keyword.get(opts, :supervisor, Foundation.TestSupervisor)
    
    child_spec = %{
      id: make_ref(),
      start: {__MODULE__, :start_link, [[fun: fun]]},
      restart: :temporary
    }
    
    case DynamicSupervisor.start_child(supervisor, child_spec) do
      {:ok, pid} -> {:ok, pid}
      error -> error
    end
  end
  
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts)
  end
  
  @impl true
  def init(opts) do
    fun = Keyword.fetch!(opts, :fun)
    
    # Run in separate process to isolate crashes
    {:ok, task} = Task.start_link(fun)
    
    {:ok, %{task: task}}
  end
  
  @impl true
  def handle_info({:EXIT, task, reason}, %{task: task} = state) do
    # Task completed
    {:stop, reason, state}
  end
end

Usage in tests:

# BEFORE:
test "spawns process" do
  pid = spawn(fn ->
    receive do
      :stop -> :ok
    end
  end)
  
  send(pid, :stop)
end

# AFTER:
test "spawns supervised process", %{test_supervisor: supervisor} do
  {:ok, pid} = SupervisedTestProcess.spawn_supervised(
    fn ->
      receive do
        :stop -> :ok
      end
    end,
    supervisor: supervisor
  )
  
  send(pid, :stop)
  
  # Process is automatically cleaned up
end

Pattern 4: Replacing :sys.get_state

Files affected: monitor_leak_test.exs and others

# BEFORE:
test "checks internal state" do
  state = :sys.get_state(server)
  assert map_size(state.connections) == 5
end

# AFTER:
# Option 1: Add test-specific API
defmodule MyServer do
  # In the actual server module
  def get_connection_count(server) do
    GenServer.call(server, :get_connection_count)
  end
  
  def handle_call(:get_connection_count, _from, state) do
    {:reply, map_size(state.connections), state}
  end
end

test "checks connection count" do
  assert MyServer.get_connection_count(server) == 5
end

# Option 2: Use debug mode in tests
test "checks state", %{server: server} do
  # Enable debug mode for this test only
  :sys.replace_state(server, fn state ->
    put_in(state.test_mode, true)
  end)
  
  {:ok, test_state} = GenServer.call(server, :get_test_state)
  assert map_size(test_state.connections) == 5
end

2.3 Deterministic Test Patterns

Priority: HIGH
Time Estimate: 2 days

Create Comprehensive Test Pattern Library

Location: test/support/deterministic_patterns.ex

defmodule Foundation.DeterministicPatterns do
  @moduledoc """
  Common patterns for deterministic testing without timing dependencies.
  Import this module in tests that need deterministic coordination.
  """
  
  import ExUnit.Assertions
  import Foundation.AsyncTestHelpers
  
  @doc """
  Waits for a GenServer to process all pending messages.
  Adds a :sync handler to your GenServer for testing.
  """
  def sync_genserver(server, timeout \\ 5000) do
    ref = make_ref()
    GenServer.call(server, {:sync, ref}, timeout)
  end
  
  @doc """
  Starts multiple processes and waits for all to be ready.
  Each process should send {:ready, self()} when initialized.
  """
  def start_and_sync_processes(specs, timeout \\ 5000) do
    parent = self()
    
    pids = Enum.map(specs, fn spec ->
      {:ok, pid} = start_process(spec, parent)
      pid
    end)
    
    # Wait for all ready signals
    Enum.each(pids, fn pid ->
      assert_receive {:ready, ^pid}, timeout
    end)
    
    pids
  end
  
  @doc """
  Coordinates multiple concurrent operations with deterministic ordering.
  """
  def coordinate_concurrent(operations, opts \\ []) do
    timeout = Keyword.get(opts, :timeout, 5000)
    ordered = Keyword.get(opts, :ordered, false)
    
    if ordered do
      # Sequential execution
      Enum.map(operations, fn op -> op.() end)
    else
      # Parallel with synchronization
      tasks = Enum.map(operations, fn op ->
        Task.async(op)
      end)
      
      Task.await_many(tasks, timeout)
    end
  end
  
  @doc """
  Tests rate limiting deterministically without time dependencies.
  """
  def test_rate_limit_window(rate_limiter, key, limit, window_ms) do
    # Clear any existing state
    :ok = RateLimiter.reset(rate_limiter, key)
    
    # Test 1: Exactly at limit
    results = for _ <- 1..limit do
      RateLimiter.check_rate_limit(rate_limiter, key)
    end
    assert Enum.all?(results, &(&1 == :ok))
    
    # Test 2: Over limit
    assert {:error, :rate_limited} = 
      RateLimiter.check_rate_limit(rate_limiter, key)
    
    # Test 3: Wait for window expiry using wait_for
    wait_for(
      fn ->
        case RateLimiter.check_rate_limit(rate_limiter, key) do
          :ok -> true
          _ -> false
        end
      end,
      window_ms + 100  # Small buffer
    )
  end
  
  @doc """
  Verifies message routing without timing assumptions.
  """
  def assert_routed_message(router, pattern, message, timeout \\ 1000) do
    # Subscribe first
    :ok = Router.subscribe(router, pattern, self())
    
    # Send message
    :ok = Router.route(router, message)
    
    # Assert receipt
    assert_receive ^message, timeout
  end
  
  @doc """
  Tests process monitoring and cleanup deterministically.
  """
  def test_monitor_cleanup(monitoring_process, monitored_pids) do
    # Monitor all processes
    refs = Enum.map(monitored_pids, fn pid ->
      GenServer.call(monitoring_process, {:monitor, pid})
    end)
    
    # Kill all monitored processes
    Enum.each(monitored_pids, fn pid ->
      Process.exit(pid, :kill)
    end)
    
    # Wait for all DOWN messages to be processed
    wait_for(fn ->
      GenServer.call(monitoring_process, :get_monitor_count) == 0
    end)
    
    # Verify cleanup
    assert GenServer.call(monitoring_process, :get_monitor_count) == 0
  end
  
  @doc """
  Barrier synchronization for multiple processes.
  All processes must reach the barrier before any can continue.
  """
  def barrier_sync(processes, barrier_name \\ :test_barrier) do
    parent = self()
    count = length(processes)
    
    # Send barrier instruction to all
    Enum.each(processes, fn pid ->
      send(pid, {:barrier, barrier_name, parent, count})
    end)
    
    # Collect ready signals
    ready_pids = for _ <- 1..count do
      assert_receive {:barrier_ready, ^barrier_name, pid}, 5000
      pid
    end
    
    # Release all processes
    Enum.each(ready_pids, fn pid ->
      send(pid, {:barrier_release, barrier_name})
    end)
    
    :ok
  end
  
  @doc """
  Test helper for verifying telemetry events with specific data.
  """
  def capture_telemetry_events(event_names, fun) do
    test_pid = self()
    ref = make_ref()
    
    handler_ids = Enum.map(event_names, fn event_name ->
      handler_id = {__MODULE__, ref, event_name}
      
      :telemetry.attach(
        handler_id,
        event_name,
        fn name, measurements, metadata, _ ->
          send(test_pid, {:telemetry_event, ref, name, measurements, metadata})
        end,
        nil
      )
      
      handler_id
    end)
    
    try do
      fun.()
      
      # Collect all events
      collect_telemetry_events(ref, length(event_names))
    after
      # Clean up handlers
      Enum.each(handler_ids, &:telemetry.detach/1)
    end
  end
  
  defp collect_telemetry_events(ref, count, timeout \\ 1000) do
    for _ <- 1..count do
      receive do
        {:telemetry_event, ^ref, name, measurements, metadata} ->
          {name, measurements, metadata}
      after
        timeout -> nil
      end
    end
    |> Enum.reject(&is_nil/1)
  end
  
  @doc """
  Helper for testing ETS-based operations deterministically.
  """
  def with_test_ets(fun, opts \\ []) do
    table_name = Keyword.get(opts, :name, :test_ets)
    table_opts = Keyword.get(opts, :table_opts, [:set, :public])
    
    table = :ets.new(table_name, table_opts)
    
    try do
      fun.(table)
    after
      :ets.delete(table)
    end
  end
  
  @doc """
  Deterministic testing of supervisor restart behavior.
  """
  def test_supervisor_restart(supervisor, child_spec, crash_fun) do
    # Start child
    {:ok, pid1} = DynamicSupervisor.start_child(supervisor, child_spec)
    
    # Monitor it
    ref = Process.monitor(pid1)
    
    # Cause crash
    crash_fun.(pid1)
    
    # Wait for death
    assert_receive {:DOWN, ^ref, :process, ^pid1, _reason}, 5000
    
    # Wait for restart
    wait_for(fn ->
      case DynamicSupervisor.which_children(supervisor) do
        [{_, pid, _, _}] when is_pid(pid) and pid != pid1 -> pid
        _ -> nil
      end
    end)
  end
  
  # Private helpers
  
  defp start_process({module, args}, parent) do
    # Start process with parent notification
    {:ok, _pid} = module.start_link(args ++ [notify: parent])
  end
end

Usage Examples for Teams

Create test/examples/deterministic_test_example.exs:

defmodule DeterministicTestExample do
  use Foundation.UnifiedTestFoundation, :full_isolation
  import Foundation.DeterministicPatterns
  
  describe "rate limiter without Process.sleep" do
    test "handles burst traffic deterministically", %{test_context: ctx} do
      {:ok, limiter} = start_rate_limiter(ctx)
      
      # Test rate limit deterministically
      test_rate_limit_window(limiter, "user1", 10, 100)
    end
  end
  
  describe "concurrent operations" do
    test "coordinates multiple agents", %{test_context: ctx} do
      # Start multiple agents
      agents = start_and_sync_processes([
        {Agent1, [context: ctx]},
        {Agent2, [context: ctx]},
        {Agent3, [context: ctx]}
      ])
      
      # Coordinate operations
      results = coordinate_concurrent([
        fn -> Agent1.process(hd(agents)) end,
        fn -> Agent2.process(hd(tl(agents))) end,
        fn -> Agent3.process(hd(tl(tl(agents)))) end
      ])
      
      assert length(results) == 3
    end
  end
  
  describe "telemetry events" do
    test "captures all events in order" do
      events = capture_telemetry_events(
        [[:my_app, :start], [:my_app, :process], [:my_app, :complete]],
        fn ->
          MyApp.do_complex_operation()
        end
      )
      
      assert length(events) == 3
      assert {[:my_app, :start], _, _} = hd(events)
    end
  end
end

2.4 Create Test Migration Script

Priority: MEDIUM
Time Estimate: 1 day

Location: scripts/migrate_tests.exs

defmodule TestMigrator do
  @moduledoc """
  Automated test migration tool.
  Handles common patterns, flags complex cases for manual review.
  """
  
  def run do
    files = find_test_files_with_issues()
    
    Enum.each(files, fn file ->
      IO.puts("Migrating: #{file}")
      migrate_file(file)
    end)
    
    generate_report()
  end
  
  defp find_test_files_with_issues do
    Path.wildcard("test/**/*_test.exs")
    |> Enum.filter(&has_issues?/1)
  end
  
  defp has_issues?(file) do
    content = File.read!(file)
    
    String.contains?(content, "Process.sleep") or
    String.contains?(content, "spawn(") or
    String.contains?(content, ":sys.get_state") or
    not String.contains?(content, "Foundation.UnifiedTestFoundation")
  end
  
  defp migrate_file(file) do
    content = File.read!(file)
    original = content
    
    # Apply migrations
    content = content
    |> add_imports()
    |> migrate_sleeps()
    |> migrate_spawns()
    |> migrate_sys_get_state()
    |> add_unified_foundation()
    
    if content != original do
      # Backup original
      File.write!("#{file}.backup", original)
      
      # Write migrated version
      File.write!(file, content)
      
      # Try to format
      System.cmd("mix", ["format", file])
      
      log_migration(file, original, content)
    end
  end
  
  defp add_imports(content) do
    if String.contains?(content, "Process.sleep") and 
       not String.contains?(content, "import Foundation.AsyncTestHelpers") do
      # Add import after module declaration
      Regex.replace(
        ~r/(defmodule \w+ do\n)/,
        content,
        "\\1  import Foundation.AsyncTestHelpers\n"
      )
    else
      content
    end
  end
  
  defp migrate_sleeps(content) do
    content
    |> migrate_simple_sleeps()
    |> migrate_window_sleeps()
    |> migrate_state_sleeps()
  end
  
  defp migrate_simple_sleeps(content) do
    # Process.sleep(n) followed by assertion
    Regex.replace(
      ~r/Process\.sleep\(\d+\)\s*\n\s*(assert .+)/m,
      content,
      "wait_for(fn -> \\1 end)"
    )
  end
  
  defp migrate_window_sleeps(content) do
    # Common rate limit pattern
    Regex.replace(
      ~r/Process\.sleep\((\d+)\)\s*#\s*[Ww]ait for window/,
      content,
      "wait_for(fn -> RateLimiter.window_expired?(key) end, \\1 + 100)"
    )
  end
  
  defp migrate_state_sleeps(content) do
    # Mark complex cases for manual review
    Regex.replace(
      ~r/Process\.sleep\((\d+)\)/,
      content,
      "# TODO: Review Process.sleep(\\1) migration\n    Process.sleep(\\1)"
    )
  end
  
  defp migrate_spawns(content) do
    Regex.replace(
      ~r/spawn\(fn ->/,
      content,
      "# TODO: Use SupervisedTestProcess\n    spawn(fn ->"
    )
  end
  
  defp migrate_sys_get_state(content) do
    Regex.replace(
      ~r/:sys\.get_state\(/,
      content,
      "# TODO: Replace with test API\n    :sys.get_state("
    )
  end
  
  defp add_unified_foundation(content) do
    if not String.contains?(content, "Foundation.UnifiedTestFoundation") do
      Regex.replace(
        ~r/use ExUnit\.Case(, async: \w+)?/,
        content,
        "use Foundation.UnifiedTestFoundation, :registry"
      )
    else
      content
    end
  end
  
  defp log_migration(file, original, migrated) do
    changes = diff_summary(original, migrated)
    
    File.write!(
      "test_migration_log.md",
      """
      ## #{file}
      
      Changes:
      #{changes}
      
      ---
      
      """,
      [:append]
    )
  end
  
  defp diff_summary(original, migrated) do
    sleep_before = count_pattern(original, "Process.sleep")
    sleep_after = count_pattern(migrated, "Process.sleep")
    
    """
    - Process.sleep: #{sleep_before} -> #{sleep_after}
    - Added imports: #{String.contains?(migrated, "import Foundation")}
    - Added UnifiedTestFoundation: #{String.contains?(migrated, "UnifiedTestFoundation")}
    - TODOs added: #{count_pattern(migrated, "TODO:")}
    """
  end
  
  defp count_pattern(content, pattern) do
    content
    |> String.split(pattern)
    |> length()
    |> Kernel.-(1)
  end
  
  defp generate_report do
    IO.puts("""
    
    Migration complete!
    
    Next steps:
    1. Review test_migration_log.md for all changes
    2. Search for "TODO:" comments and fix manually
    3. Run tests to ensure they still pass
    4. Check for flaky tests by running multiple times
    5. Remove .backup files after verification
    """)
  end
end

# Run migration
TestMigrator.run()

Verification Process

After Each File Migration

Run single test multiple times:

# Run 100 times to check for flakes
for i in {1..100}; do
  mix test path/to/test.exs || break
done

Run under load:

# Stress test with parallelism
mix test path/to/test.exs --max-cases 32 --seed 0

Measure improvement:

# Before (with sleeps)
time mix test path/to/test.exs

# After (deterministic)
time mix test path/to/test.exs

# Should see significant speedup

Full Test Suite Verification

Run Credo checks:

mix credo --strict
# Process.sleep violations should decrease

Run all tests:

# Full suite
mix test

# With coverage
mix test --cover

CI verification:

# Push to branch and check CI passes
git add .
git commit -m "Stage 2: Remove Process.sleep from tests"
git push origin test-cleanup-stage-2

Common Issues & Solutions

Issue: wait_for/1 timeouts

Solution: Increase timeout or check condition is achievable

# Increase timeout for slow operations
wait_for(fn -> condition end, 10_000)  # 10 seconds

Issue: Test is genuinely time-dependent

Solution: Use deterministic time control

# For rate limiters, provide time injection
defmodule RateLimiter do
  def check_rate_limit(key, opts \\ []) do
    now = Keyword.get(opts, :now, System.system_time(:millisecond))
    # ... use injected time
  end
end

Issue: Complex async coordination

Solution: Use the DeterministicPatterns helpers

import Foundation.DeterministicPatterns

# Use barriers, coordinate_concurrent, etc.

Success Criteria

Stage 2 is complete when:

✅ 0 Process.sleep calls remaining (down from 26)
✅ All tests use UnifiedTestFoundation where appropriate
✅ No :sys.get_state usage
✅ No raw spawn in tests
✅ Test suite runs 30-50% faster
✅ No flaky tests in CI (run 10 times successfully)
✅ All tests pass consistently

Next Steps

After completing Stage 2:

Verify all tests pass reliably
Check test execution time improvement
Update CI to enforce no Process.sleep
Proceed to Stage 3 for production code fixes
Consider adding property-based tests

Completion Checklist:

Sleep migration guide created
All 26 Process.sleep calls removed
Test isolation audit complete
UnifiedTestFoundation added where needed
Deterministic patterns library created
Migration scripts run
All tests passing
CI verification complete
Performance improvement measured

OTP Implementation Plan - Stage 2: Test Suite Remediation

Overview

Context Documents

Current State

Process.sleep Usage (26 occurrences in 6 files)

Other Anti-patterns

Stage 2 Deliverables

2.1 Process.sleep Elimination

Step 1: Create Migration Guide

After:

Pattern 2: Waiting for State Change

Before:

After:

Pattern 3: Rate Limit Window Expiry

Before:

After:

Pattern 4: Telemetry Events

Before:

After:

Pattern 5: Message Processing

Before:

After:

Pattern 6: Clearing State

Before:

After:

Special Cases

Load Testing

Benchmarking

Verification

File 2: test/foundation/monitor_leak_test.exs (7 sleeps)

File 3: test/foundation/telemetry/load_test_test.exs (5 sleeps)

File 4: test/foundation/telemetry/sampler_test.exs (3 sleeps)

File 5: test/mabeam/agent_registry_test.exs (1 sleep)

File 6: test/telemetry_performance_comparison.exs (1 sleep)

2.2 Test Isolation Implementation

Step 1: Create Isolation Audit Script

Step 2: Migration Patterns for Test Isolation

Pattern 1: Adding UnifiedTestFoundation

Pattern 2: Replacing Global Process Access

Pattern 3: Supervised Test Processes

Pattern 4: Replacing :sys.get_state

2.3 Deterministic Test Patterns

Create Comprehensive Test Pattern Library

Usage Examples for Teams

2.4 Create Test Migration Script

Verification Process

After Each File Migration

Full Test Suite Verification

Common Issues & Solutions

Issue: wait_for/1 timeouts

Issue: Test is genuinely time-dependent

Issue: Complex async coordination

Success Criteria

Next Steps

File 2: `test/foundation/monitor_leak_test.exs` (7 sleeps)

File 3: `test/foundation/telemetry/load_test_test.exs` (5 sleeps)

File 4: `test/foundation/telemetry/sampler_test.exs` (3 sleeps)

File 5: `test/mabeam/agent_registry_test.exs` (1 sleep)

File 6: `test/telemetry_performance_comparison.exs` (1 sleep)