UNIFIED TEST FOUNDATION buildDoc

Documentation for UNIFIED_TEST_FOUNDATION_buildDoc from the Foundation repository.

Unified Test Foundation Migration Guide

Complete Test Isolation for Supervision Crash Recovery Tests

Document Version: 1.0
Date: 2025-07-02
Context: Fix supervision crash recovery test contamination with proper OTP test isolation
Target: test/jido_foundation/supervision_crash_recovery_test.exs

Problem Statement

Current Issues

The supervision crash recovery tests exhibit test contamination when run together:

✅ All individual tests pass when run alone
❌ Tests fail with (EXIT from #PID<0.95.0>) shutdown when run in batch
Root Cause: Shared global JidoFoundation processes across tests
OTP Violation: Tests hitting production supervision tree instead of isolated test supervisors

Architecture Problems

# CURRENT: Shared global state (BAD)
defmodule JidoFoundation.SupervisionCrashRecoveryTest do
  use ExUnit.Case, async: false  # Still shares global processes!
  
  test "crash recovery" do
    # Kills GLOBAL JidoFoundation.TaskPoolManager
    task_pid = Process.whereis(JidoFoundation.TaskPoolManager) 
    Process.exit(task_pid, :kill)
    # Affects ALL subsequent tests!
  end
end

Problem: Every test crashes the same shared supervision tree, leaving it in unstable states that contaminate subsequent tests.

Solution: Complete Test Isolation

Architecture Overview

# TARGET: Isolated supervision per test (GOOD)
defmodule JidoFoundation.SupervisionCrashRecoveryTest do
  use Foundation.UnifiedTestFoundation, :supervision_testing
  
  test "crash recovery", %{supervision_tree: sup_tree} do
    # Kills ISOLATED TaskPoolManager for this test only
    {:ok, task_pid} = sup_tree |> get_service(:task_pool_manager)
    Process.exit(task_pid, :kill)
    # No effect on other tests!
  end
end

Solution: Each test gets its own supervision tree with isolated JidoFoundation services.

Implementation Plan

Phase 1: Foundation Infrastructure Enhancement

1.1 Create Supervision Testing Mode

File: test/support/unified_test_foundation.ex

# Add new testing mode for supervision crash recovery
defmodule Foundation.UnifiedTestFoundation do
  @doc """
  Enhanced test foundation with supervision testing support.
  
  ## New Mode: :supervision_testing
  
  Provides isolated JidoFoundation supervision trees for crash recovery testing.
  Each test gets its own complete JidoSystem.Supervisor instance.
  """
  
  defmacro __using__(mode) when mode == :supervision_testing do
    quote do
      use ExUnit.Case, async: false  # Supervision tests need serialization
      
      import Foundation.AsyncTestHelpers
      import Foundation.SupervisionTestHelpers  # New module
      
      setup do
        Foundation.SupervisionTestSetup.create_isolated_supervision()
      end
    end
  end
  
  # ... existing modes remain unchanged
end

1.2 Create Supervision Test Helpers

File: test/support/supervision_test_helpers.ex

defmodule Foundation.SupervisionTestHelpers do
  @moduledoc """
  Helper functions for testing supervision crash recovery with isolated supervision trees.
  
  Provides utilities for:
  - Creating isolated JidoFoundation supervision trees
  - Accessing services within test supervision context
  - Monitoring process lifecycle in test environment
  - Verifying supervision behavior without global contamination
  """
  
  import Foundation.AsyncTestHelpers
  
  @doc """
  Get a service PID from the test supervision tree.
  
  ## Examples
  
      test "crash recovery", %{supervision_tree: sup_tree} do
        {:ok, task_pid} = get_service(sup_tree, :task_pool_manager)
        assert is_pid(task_pid)
      end
  """
  def get_service(supervision_context, service_name) do
    service_module = service_name_to_module(service_name)
    
    case supervision_context.registry
         |> Registry.lookup({:service, service_module}) do
      [{pid, _}] when is_pid(pid) -> {:ok, pid}
      [] -> {:error, :service_not_found}
    end
  end
  
  @doc """
  Wait for a service to restart after crash in isolated supervision tree.
  
  ## Examples
  
      test "restart behavior", %{supervision_tree: sup_tree} do
        {:ok, old_pid} = get_service(sup_tree, :task_pool_manager)
        Process.exit(old_pid, :kill)
        
        {:ok, new_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, old_pid)
        assert new_pid != old_pid
      end
  """
  def wait_for_service_restart(supervision_context, service_name, old_pid, timeout \\ 5000) do
    wait_for(
      fn ->
        case get_service(supervision_context, service_name) do
          {:ok, new_pid} when new_pid != old_pid and is_pid(new_pid) -> 
            {:ok, new_pid}
          _ -> 
            nil
        end
      end,
      timeout
    )
  end
  
  @doc """
  Monitor all processes in supervision tree for proper shutdown cascade testing.
  
  Returns a map of service_name => monitor_ref for easy assertion.
  """
  def monitor_all_services(supervision_context) do
    services = [:task_pool_manager, :system_command_manager, :coordination_manager, :scheduler_manager]
    
    for service <- services, into: %{} do
      {:ok, pid} = get_service(supervision_context, service)
      ref = Process.monitor(pid)
      {service, {pid, ref}}
    end
  end
  
  @doc """
  Verify rest_for_one supervision behavior in isolated environment.
  
  ## Examples
  
      test "rest_for_one cascade", %{supervision_tree: sup_tree} do
        monitors = monitor_all_services(sup_tree)
        
        # Kill TaskPoolManager 
        {task_pid, _} = monitors[:task_pool_manager]
        Process.exit(task_pid, :kill)
        
        # Verify cascade: SystemCommandManager + CoordinationManager restart
        # SchedulerManager should NOT restart (starts before TaskPoolManager)
        verify_rest_for_one_cascade(monitors, :task_pool_manager)
      end
  """
  def verify_rest_for_one_cascade(monitors, crashed_service) do
    supervision_order = [:scheduler_manager, :task_pool_manager, :system_command_manager, :coordination_manager]
    crashed_index = Enum.find_index(supervision_order, &(&1 == crashed_service))
    
    # Services started before crashed service should remain alive
    services_before = Enum.take(supervision_order, crashed_index)
    # Services started after crashed service should restart
    services_after = Enum.drop(supervision_order, crashed_index + 1)
    
    # Wait for crashed service DOWN message
    {crashed_pid, crashed_ref} = monitors[crashed_service]
    assert_receive {:DOWN, ^crashed_ref, :process, ^crashed_pid, :killed}, 2000
    
    # Wait for dependent services DOWN messages (supervisor shutdown)
    for service <- services_after do
      {pid, ref} = monitors[service]
      assert_receive {:DOWN, ^ref, :process, ^pid, reason}, 2000
      assert reason in [:shutdown, :killed]
    end
    
    # Verify services before remain alive
    for service <- services_before do
      {pid, _ref} = monitors[service]
      assert Process.alive?(pid), "#{service} should remain alive (started before crashed service)"
    end
    
    :ok
  end
  
  # Private helpers
  
  defp service_name_to_module(:task_pool_manager), do: JidoFoundation.TaskPoolManager
  defp service_name_to_module(:system_command_manager), do: JidoFoundation.SystemCommandManager  
  defp service_name_to_module(:coordination_manager), do: JidoFoundation.CoordinationManager
  defp service_name_to_module(:scheduler_manager), do: JidoFoundation.SchedulerManager
end

1.3 Create Supervision Test Setup

File: test/support/supervision_test_setup.ex

defmodule Foundation.SupervisionTestSetup do
  @moduledoc """
  Setup infrastructure for isolated supervision testing.
  
  Creates complete JidoFoundation supervision trees in isolation,
  allowing crash recovery tests without global state contamination.
  """
  
  require Logger
  import Foundation.AsyncTestHelpers
  
  @doc """
  Create an isolated supervision tree for testing JidoFoundation crash recovery.
  
  Returns supervision context with:
  - Isolated JidoSystem.Supervisor instance
  - Test-specific Registry for service discovery
  - Service PIDs accessible via helper functions
  - Proper cleanup on test exit
  """
  def create_isolated_supervision do
    # Create unique test identifier
    test_id = :erlang.unique_integer([:positive])
    test_registry = :"test_jido_registry_#{test_id}"
    test_supervisor = :"test_jido_supervisor_#{test_id}"
    
    # Create test-specific registry
    {:ok, registry_pid} = Registry.start_link(
      keys: :unique, 
      name: test_registry,
      partitions: 1
    )
    
    # Start isolated JidoFoundation supervision tree
    {:ok, supervisor_pid} = start_isolated_jido_supervisor(test_supervisor, test_registry)
    
    # Wait for all services to be registered and stable
    services = [
      JidoFoundation.TaskPoolManager,
      JidoFoundation.SystemCommandManager,
      JidoFoundation.CoordinationManager,
      JidoFoundation.SchedulerManager
    ]
    
    wait_for_services_ready(test_registry, services)
    
    supervision_context = %{
      test_id: test_id,
      registry: test_registry,
      registry_pid: registry_pid,
      supervisor: test_supervisor,
      supervisor_pid: supervisor_pid,
      services: services
    }
    
    # Setup cleanup
    on_exit(fn ->
      cleanup_isolated_supervision(supervision_context)
    end)
    
    %{supervision_tree: supervision_context}
  end
  
  defp start_isolated_jido_supervisor(supervisor_name, registry_name) do
    # Create isolated version of JidoSystem.Application children
    # but with test-specific registry registration
    children = [
      # 1. State persistence supervisor
      {JidoSystem.Agents.StateSupervisor, name: :"#{supervisor_name}_state_supervisor"},
      
      # 2. Test-specific registries
      {Registry, keys: :unique, name: :"#{supervisor_name}_monitor_registry"},
      {Registry, keys: :unique, name: :"#{supervisor_name}_workflow_registry"},
      
      # 3. Core infrastructure services (isolated instances)
      {JidoSystem.ErrorStore, name: :"#{supervisor_name}_error_store"},
      {JidoSystem.HealthMonitor, name: :"#{supervisor_name}_health_monitor"},
      
      # 4. Manager services with test-specific registration
      {JidoFoundation.SchedulerManager, 
       name: :"#{supervisor_name}_scheduler_manager",
       registry: registry_name},
      {JidoFoundation.TaskPoolManager, 
       name: :"#{supervisor_name}_task_pool_manager", 
       registry: registry_name},
      {JidoFoundation.SystemCommandManager, 
       name: :"#{supervisor_name}_system_command_manager",
       registry: registry_name},
      {JidoFoundation.CoordinationManager, 
       name: :"#{supervisor_name}_coordination_manager",
       registry: registry_name},
      
      # 5. Additional supervisors
      {JidoFoundation.MonitorSupervisor, name: :"#{supervisor_name}_monitor_supervisor"},
      {JidoSystem.Supervisors.WorkflowSupervisor, name: :"#{supervisor_name}_workflow_supervisor"},
      
      # 6. Dynamic supervisor for agents
      {DynamicSupervisor, 
       name: :"#{supervisor_name}_agent_supervisor", 
       strategy: :one_for_one}
    ]
    
    opts = [
      strategy: :rest_for_one,  # Same as production for accurate testing
      name: supervisor_name,
      max_restarts: 3,
      max_seconds: 5
    ]
    
    Supervisor.start_link(children, opts)
  end
  
  defp wait_for_services_ready(registry_name, services) do
    for service_module <- services do
      wait_for(
        fn ->
          case Registry.lookup(registry_name, {:service, service_module}) do
            [{pid, _}] when is_pid(pid) and Process.alive?(pid) -> pid
            _ -> nil
          end
        end,
        10_000
      )
    end
  end
  
  defp cleanup_isolated_supervision(context) do
    # Terminate supervisor tree gracefully
    if Process.alive?(context.supervisor_pid) do
      Supervisor.stop(context.supervisor_pid, :normal, 5000)
    end
    
    # Terminate registry
    if Process.alive?(context.registry_pid) do
      GenServer.stop(context.registry_pid, :normal, 5000)
    end
    
    # Wait for cleanup to complete
    wait_for(
      fn ->
        not Process.alive?(context.supervisor_pid) and 
        not Process.alive?(context.registry_pid)
      end,
      5000
    )
    
    Logger.debug("Cleaned up isolated supervision tree: #{context.test_id}")
  end
end

Phase 2: Service Registration Enhancement

2.1 Modify JidoFoundation Services for Test Registration

File: lib/jido_foundation/task_pool_manager.ex

defmodule JidoFoundation.TaskPoolManager do
  use GenServer
  
  # Add support for test-specific registry registration
  def start_link(opts \\ []) do
    name = Keyword.get(opts, :name, __MODULE__)
    registry = Keyword.get(opts, :registry, nil)
    
    GenServer.start_link(__MODULE__, {opts, registry}, name: name)
  end
  
  def init({opts, registry}) do
    # Register with test registry if provided
    if registry do
      Registry.register(registry, {:service, __MODULE__}, %{test_instance: true})
    end
    
    # ... existing init logic
  end
  
  # ... rest of module unchanged
end

Apply same pattern to:

lib/jido_foundation/system_command_manager.ex
lib/jido_foundation/coordination_manager.ex
lib/jido_foundation/scheduler_manager.ex

Phase 3: Test Migration

3.1 Migrate Supervision Crash Recovery Test

File: test/jido_foundation/supervision_crash_recovery_test.exs

defmodule JidoFoundation.SupervisionCrashRecoveryTest do
  @moduledoc """
  Comprehensive supervision crash recovery tests with complete test isolation.
  
  Uses Foundation.UnifiedTestFoundation :supervision_testing mode to create
  isolated supervision trees per test, eliminating test contamination.
  
  Each test gets its own JidoFoundation services, enabling proper crash
  recovery testing without affecting other tests.
  """
  
  use Foundation.UnifiedTestFoundation, :supervision_testing
  require Logger
  
  import Foundation.SupervisionTestHelpers
  
  @moduletag :supervision_testing
  @moduletag timeout: 30_000
  
  describe "Isolated TaskPoolManager crash recovery" do
    test "TaskPoolManager restarts after crash and maintains functionality", 
         %{supervision_tree: sup_tree} do
      # Get TaskPoolManager from isolated supervision tree
      {:ok, initial_pid} = get_service(sup_tree, :task_pool_manager)
      assert is_pid(initial_pid)
      
      # Test functionality before crash (using isolated instance)
      # Note: Functions now work with isolated services via Registry lookup
      stats = call_isolated_service(sup_tree, :task_pool_manager, :get_all_stats)
      assert is_map(stats)
      
      # Kill the isolated TaskPoolManager process
      Process.exit(initial_pid, :kill)
      
      # Wait for supervisor to restart it with new pid
      {:ok, new_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, initial_pid)
      
      # Verify it restarted with new pid
      assert is_pid(new_pid)
      assert new_pid != initial_pid
      
      # Verify functionality is restored (isolated instance)
      new_stats = call_isolated_service(sup_tree, :task_pool_manager, :get_all_stats)
      assert is_map(new_stats)
      
      # Test that pools can be created and used in isolated environment
      case call_isolated_service(sup_tree, :task_pool_manager, 
             {:execute_batch, [:general, [1, 2, 3], fn x -> x * 2 end, [timeout: 1000]]}) do
        {:ok, stream} ->
          results = Enum.to_list(stream)
          assert length(results) == 3
          assert {:ok, 2} in results
          assert {:ok, 4} in results
          assert {:ok, 6} in results
          
        {:error, :pool_not_found} ->
          # Pool may not be ready yet after restart in isolated env
          :ok
      end
    end
    
    test "TaskPoolManager survives pool supervisor crashes", 
         %{supervision_tree: sup_tree} do
      {:ok, manager_pid} = get_service(sup_tree, :task_pool_manager)
      assert is_pid(manager_pid)
      
      # Create a test pool in isolated environment
      :ok = call_isolated_service(sup_tree, :task_pool_manager, 
        {:create_pool, [:test_crash_pool, %{max_concurrency: 2, timeout: 5000}]})
      
      # Get pool stats to verify it's working (isolated)
      {:ok, stats} = call_isolated_service(sup_tree, :task_pool_manager, 
        {:get_pool_stats, [:test_crash_pool]})
      assert stats.max_concurrency == 2
      
      # Test the pool with batch operation (isolated)
      {:ok, stream} = call_isolated_service(sup_tree, :task_pool_manager,
        {:execute_batch, [:test_crash_pool, [1, 2, 3], fn i -> i * 10 end, [timeout: 2000]]})
      
      results = Enum.to_list(stream)
      success_results = Enum.filter(results, fn
        {:ok, _} -> true
        _ -> false
      end)
      
      assert length(success_results) == 3
      
      # Verify TaskPoolManager is still alive and functional (isolated)
      assert Process.alive?(manager_pid)
      {:ok, final_stats} = call_isolated_service(sup_tree, :task_pool_manager,
        {:get_pool_stats, [:test_crash_pool]})
      assert is_map(final_stats)
    end
  end
  
  describe "rest_for_one supervision strategy testing" do
    test "Service failures cause proper dependent restarts with :rest_for_one",
         %{supervision_tree: sup_tree} do
      # Monitor all services in isolated supervision tree
      monitors = monitor_all_services(sup_tree)
      
      # Kill TaskPoolManager in isolated environment
      {task_pid, _} = monitors[:task_pool_manager]
      Process.exit(task_pid, :kill)
      
      # Verify rest_for_one cascade behavior in isolation
      verify_rest_for_one_cascade(monitors, :task_pool_manager)
      
      # Verify services are functioning after restart in isolated environment
      {:ok, _} = get_service(sup_tree, :task_pool_manager)
      {:ok, _} = get_service(sup_tree, :system_command_manager)  
      {:ok, _} = get_service(sup_tree, :coordination_manager)
      
      # SchedulerManager should have same PID (not restarted)
      {original_sched_pid, _} = monitors[:scheduler_manager]
      {:ok, current_sched_pid} = get_service(sup_tree, :scheduler_manager)
      assert original_sched_pid == current_sched_pid
    end
    
    test "Multiple simultaneous crashes don't bring down the system",
         %{supervision_tree: sup_tree} do
      # Get initial service PIDs from isolated tree
      {:ok, task_pid} = get_service(sup_tree, :task_pool_manager)
      {:ok, sys_pid} = get_service(sup_tree, :system_command_manager)
      {:ok, sched_pid} = get_service(sup_tree, :scheduler_manager)
      
      # Monitor for proper shutdown detection
      task_ref = Process.monitor(task_pid)
      sys_ref = Process.monitor(sys_pid)
      sched_ref = Process.monitor(sched_pid)
      
      # Kill multiple services simultaneously in isolated environment
      Process.exit(task_pid, :kill)
      Process.exit(sys_pid, :kill)
      Process.exit(sched_pid, :kill)
      
      # Wait for DOWN messages
      assert_receive {:DOWN, ^task_ref, :process, ^task_pid, :killed}, 2000
      assert_receive {:DOWN, ^sys_ref, :process, ^sys_pid, :killed}, 2000
      assert_receive {:DOWN, ^sched_ref, :process, ^sched_pid, :killed}, 2000
      
      # Wait for all services to restart in isolated environment
      {:ok, new_task_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, task_pid, 8000)
      {:ok, new_sys_pid} = wait_for_service_restart(sup_tree, :system_command_manager, sys_pid, 8000)
      {:ok, new_sched_pid} = wait_for_service_restart(sup_tree, :scheduler_manager, sched_pid, 8000)
      
      # Verify all services restarted with new PIDs
      assert new_task_pid != task_pid
      assert new_sys_pid != sys_pid  
      assert new_sched_pid != sched_pid
      
      # Verify functionality is restored in isolated environment
      stats = call_isolated_service(sup_tree, :task_pool_manager, :get_all_stats)
      case stats do
        {:ok, _stats} -> :ok
        stats when is_map(stats) -> :ok
        _other -> flunk("Could not get TaskPoolManager stats in isolated environment")
      end
    end
  end
  
  describe "Process leak validation in isolated environment" do
    test "No process leaks after service crashes and restarts", 
         %{supervision_tree: sup_tree} do
      initial_count = :erlang.system_info(:process_count)
      
      # Test crash/restart cycle in isolated environment
      {:ok, task_pid} = get_service(sup_tree, :task_pool_manager)
      
      # Monitor the process before killing it
      ref = Process.monitor(task_pid)
      Process.exit(task_pid, :kill)
      
      # Wait for the DOWN message
      assert_receive {:DOWN, ^ref, :process, ^task_pid, :killed}, 1000
      
      # Wait for restart in isolated supervision tree
      {:ok, new_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, task_pid, 5000)
      
      assert is_pid(new_pid)
      assert new_pid != task_pid
      
      # Allow stabilization
      Process.sleep(1000)
      
      final_count = :erlang.system_info(:process_count)
      
      # Process count should be stable (isolated environment)
      assert final_count - initial_count < 20,
             "Process count increased significantly: #{initial_count} -> #{final_count}"
    end
  end
  
  # Helper function to call services in isolated supervision tree
  defp call_isolated_service(sup_tree, service_name, function_or_call) do
    {:ok, pid} = get_service(sup_tree, service_name)
    
    case function_or_call do
      atom when is_atom(atom) ->
        GenServer.call(pid, atom)
      {function, args} when is_atom(function) and is_list(args) ->
        apply(GenServer, :call, [pid, {function, args}])
      _ ->
        GenServer.call(pid, function_or_call)
    end
  end
end

Phase 4: Registry Integration

4.1 Service Discovery Helper

File: test/support/isolated_service_discovery.ex

defmodule Foundation.IsolatedServiceDiscovery do
  @moduledoc """
  Service discovery utilities for isolated supervision testing.
  
  Provides transparent access to JidoFoundation services running
  in isolated supervision trees during testing.
  """
  
  @doc """
  Call a JidoFoundation service function in isolated test environment.
  
  Automatically routes calls to the correct isolated service instance
  based on the current test's supervision context.
  
  ## Examples
  
      # Instead of:
      TaskPoolManager.get_all_stats()
      
      # Use in isolated tests:
      call_service(sup_tree, TaskPoolManager, :get_all_stats)
      
      # Or with arguments:
      call_service(sup_tree, TaskPoolManager, :create_pool, [:test_pool, %{max_concurrency: 4}])
  """
  def call_service(supervision_context, service_module, function, args \\ []) do
    case Registry.lookup(supervision_context.registry, {:service, service_module}) do
      [{pid, _}] when is_pid(pid) ->
        case args do
          [] -> GenServer.call(pid, function)
          _ -> GenServer.call(pid, {function, args})
        end
        
      [] ->
        {:error, {:service_not_found, service_module}}
    end
  end
  
  @doc """
  Cast to a JidoFoundation service in isolated test environment.
  """
  def cast_service(supervision_context, service_module, message) do
    case Registry.lookup(supervision_context.registry, {:service, service_module}) do
      [{pid, _}] when is_pid(pid) ->
        GenServer.cast(pid, message)
        
      [] ->
        {:error, {:service_not_found, service_module}}
    end
  end
end

Phase 5: Migration Checklist

5.1 Pre-Migration Verification

# 1. Verify current test behavior
mix test test/jido_foundation/supervision_crash_recovery_test.exs --trace

# 2. Run individual tests to confirm they pass
for test_line in 80 132 174 216 234 264 293 355 421 482 530 582 650 747 806; do
  echo "Testing line $test_line"
  mix test test/jido_foundation/supervision_crash_recovery_test.exs:$test_line
done

# 3. Confirm batch failure
mix test test/jido_foundation/supervision_crash_recovery_test.exs --seed 123456

5.2 Migration Steps

Create Foundation Infrastructure

# Create new files
touch test/support/supervision_test_helpers.ex
touch test/support/supervision_test_setup.ex
touch test/support/isolated_service_discovery.ex

Enhance UnifiedTestFoundation

# Modify existing file
$EDITOR test/support/unified_test_foundation.ex

Update JidoFoundation Services

# Add registry support to services
$EDITOR lib/jido_foundation/task_pool_manager.ex
$EDITOR lib/jido_foundation/system_command_manager.ex  
$EDITOR lib/jido_foundation/coordination_manager.ex
$EDITOR lib/jido_foundation/scheduler_manager.ex

Migrate Test File

# Backup original
cp test/jido_foundation/supervision_crash_recovery_test.exs \
   test/jido_foundation/supervision_crash_recovery_test.exs.backup

# Apply new implementation
$EDITOR test/jido_foundation/supervision_crash_recovery_test.exs

5.3 Post-Migration Verification

# 1. Verify individual tests still pass with new implementation
for test_line in 80 132 174 216 234 264 293 355 421 482 530 582 650 747 806; do
  echo "Testing isolated line $test_line"
  mix test test/jido_foundation/supervision_crash_recovery_test.exs:$test_line
done

# 2. Verify batch tests now pass (KEY SUCCESS METRIC)
mix test test/jido_foundation/supervision_crash_recovery_test.exs --seed 123456

# 3. Run multiple times to ensure stability
for i in {1..5}; do
  echo "Batch test run $i"
  mix test test/jido_foundation/supervision_crash_recovery_test.exs
done

# 4. Performance verification (should be similar or better)
time mix test test/jido_foundation/supervision_crash_recovery_test.exs

Benefits of Complete Test Isolation

1. Eliminated Test Contamination

✅ Each test gets fresh supervision tree
✅ No shared global state between tests
✅ Tests can run in any order safely
✅ Parallel execution becomes possible (async: true)

2. Improved OTP Compliance

✅ Tests follow proper OTP supervision testing patterns
✅ No pollution of production supervision trees
✅ Accurate simulation of real supervision behavior
✅ Clean separation of test and production concerns

3. Enhanced Debugging

✅ Test failures are isolated and reproducible
✅ No mysterious cascade failures from other tests
✅ Clear process lifecycle visibility per test
✅ Deterministic behavior regardless of test order

4. Better Performance

✅ No waiting for global state cleanup between tests
✅ Potential for parallel execution with async: true
✅ Faster test startup (no shared resource contention)
✅ Consistent execution times

5. Maintainability

✅ Clear test isolation boundaries
✅ Easy to add new supervision tests
✅ No complex inter-test dependencies
✅ Self-documenting test behavior

Success Metrics

Pre-Migration (Current State)

❌ Batch Tests: Fail with (EXIT from #PID<0.95.0>) shutdown
✅ Individual Tests: All 15 tests pass when run alone
❌ Test Order Dependency: Tests fail when run in certain orders
❌ Reproducibility: Batch failures are intermittent

Post-Migration (Target State)

✅ Batch Tests: All 15 tests pass when run together
✅ Individual Tests: All 15 tests continue to pass alone
✅ Test Order Independence: Tests pass in any order
✅ Reproducibility: Consistent behavior across all runs
✅ Performance: Similar or improved execution time
✅ OTP Compliance: Proper supervision testing patterns

Implementation Timeline

Week 1: Foundation Infrastructure

Day 1-2: Create SupervisionTestHelpers and SupervisionTestSetup
Day 3-4: Enhance UnifiedTestFoundation with :supervision_testing mode
Day 5: Create IsolatedServiceDiscovery utilities

Week 2: Service Enhancement

Day 1-3: Add registry support to all JidoFoundation services
Day 4-5: Test service isolation functionality

Week 3: Test Migration

Day 1-3: Migrate supervision crash recovery test
Day 4-5: Verification and performance testing

Week 4: Documentation and Optimization

Day 1-2: Complete documentation and examples
Day 3-5: Performance optimization and final verification

Risk Mitigation

1. Backward Compatibility

Keep original test file as backup
Ensure production services work unchanged
Gradual migration with fallback options

2. Performance Impact

Monitor test execution times
Optimize supervision tree startup
Consider lazy service initialization

3. Complexity Management

Comprehensive documentation
Clear helper function interfaces
Step-by-step migration guide

4. Integration Issues

Test with existing CI/CD pipelines
Verify with different Elixir/OTP versions
Validate with production-like loads

Result: Complete elimination of test contamination through proper OTP supervision testing patterns, enabling reliable and maintainable crash recovery testing.