Unified Test Foundation Migration Guide
Complete Test Isolation for Supervision Crash Recovery Tests
Document Version: 1.0
Date: 2025-07-02
Context: Fix supervision crash recovery test contamination with proper OTP test isolation
Target: test/jido_foundation/supervision_crash_recovery_test.exs
Problem Statement
Current Issues
The supervision crash recovery tests exhibit test contamination when run together:
- ✅ All individual tests pass when run alone
- ❌ Tests fail with
(EXIT from #PID<0.95.0>) shutdown
when run in batch - Root Cause: Shared global JidoFoundation processes across tests
- OTP Violation: Tests hitting production supervision tree instead of isolated test supervisors
Architecture Problems
# CURRENT: Shared global state (BAD)
defmodule JidoFoundation.SupervisionCrashRecoveryTest do
use ExUnit.Case, async: false # Still shares global processes!
test "crash recovery" do
# Kills GLOBAL JidoFoundation.TaskPoolManager
task_pid = Process.whereis(JidoFoundation.TaskPoolManager)
Process.exit(task_pid, :kill)
# Affects ALL subsequent tests!
end
end
Problem: Every test crashes the same shared supervision tree, leaving it in unstable states that contaminate subsequent tests.
Solution: Complete Test Isolation
Architecture Overview
# TARGET: Isolated supervision per test (GOOD)
defmodule JidoFoundation.SupervisionCrashRecoveryTest do
use Foundation.UnifiedTestFoundation, :supervision_testing
test "crash recovery", %{supervision_tree: sup_tree} do
# Kills ISOLATED TaskPoolManager for this test only
{:ok, task_pid} = sup_tree |> get_service(:task_pool_manager)
Process.exit(task_pid, :kill)
# No effect on other tests!
end
end
Solution: Each test gets its own supervision tree with isolated JidoFoundation services.
Implementation Plan
Phase 1: Foundation Infrastructure Enhancement
1.1 Create Supervision Testing Mode
File: test/support/unified_test_foundation.ex
# Add new testing mode for supervision crash recovery
defmodule Foundation.UnifiedTestFoundation do
@doc """
Enhanced test foundation with supervision testing support.
## New Mode: :supervision_testing
Provides isolated JidoFoundation supervision trees for crash recovery testing.
Each test gets its own complete JidoSystem.Supervisor instance.
"""
defmacro __using__(mode) when mode == :supervision_testing do
quote do
use ExUnit.Case, async: false # Supervision tests need serialization
import Foundation.AsyncTestHelpers
import Foundation.SupervisionTestHelpers # New module
setup do
Foundation.SupervisionTestSetup.create_isolated_supervision()
end
end
end
# ... existing modes remain unchanged
end
1.2 Create Supervision Test Helpers
File: test/support/supervision_test_helpers.ex
defmodule Foundation.SupervisionTestHelpers do
@moduledoc """
Helper functions for testing supervision crash recovery with isolated supervision trees.
Provides utilities for:
- Creating isolated JidoFoundation supervision trees
- Accessing services within test supervision context
- Monitoring process lifecycle in test environment
- Verifying supervision behavior without global contamination
"""
import Foundation.AsyncTestHelpers
@doc """
Get a service PID from the test supervision tree.
## Examples
test "crash recovery", %{supervision_tree: sup_tree} do
{:ok, task_pid} = get_service(sup_tree, :task_pool_manager)
assert is_pid(task_pid)
end
"""
def get_service(supervision_context, service_name) do
service_module = service_name_to_module(service_name)
case supervision_context.registry
|> Registry.lookup({:service, service_module}) do
[{pid, _}] when is_pid(pid) -> {:ok, pid}
[] -> {:error, :service_not_found}
end
end
@doc """
Wait for a service to restart after crash in isolated supervision tree.
## Examples
test "restart behavior", %{supervision_tree: sup_tree} do
{:ok, old_pid} = get_service(sup_tree, :task_pool_manager)
Process.exit(old_pid, :kill)
{:ok, new_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, old_pid)
assert new_pid != old_pid
end
"""
def wait_for_service_restart(supervision_context, service_name, old_pid, timeout \\ 5000) do
wait_for(
fn ->
case get_service(supervision_context, service_name) do
{:ok, new_pid} when new_pid != old_pid and is_pid(new_pid) ->
{:ok, new_pid}
_ ->
nil
end
end,
timeout
)
end
@doc """
Monitor all processes in supervision tree for proper shutdown cascade testing.
Returns a map of service_name => monitor_ref for easy assertion.
"""
def monitor_all_services(supervision_context) do
services = [:task_pool_manager, :system_command_manager, :coordination_manager, :scheduler_manager]
for service <- services, into: %{} do
{:ok, pid} = get_service(supervision_context, service)
ref = Process.monitor(pid)
{service, {pid, ref}}
end
end
@doc """
Verify rest_for_one supervision behavior in isolated environment.
## Examples
test "rest_for_one cascade", %{supervision_tree: sup_tree} do
monitors = monitor_all_services(sup_tree)
# Kill TaskPoolManager
{task_pid, _} = monitors[:task_pool_manager]
Process.exit(task_pid, :kill)
# Verify cascade: SystemCommandManager + CoordinationManager restart
# SchedulerManager should NOT restart (starts before TaskPoolManager)
verify_rest_for_one_cascade(monitors, :task_pool_manager)
end
"""
def verify_rest_for_one_cascade(monitors, crashed_service) do
supervision_order = [:scheduler_manager, :task_pool_manager, :system_command_manager, :coordination_manager]
crashed_index = Enum.find_index(supervision_order, &(&1 == crashed_service))
# Services started before crashed service should remain alive
services_before = Enum.take(supervision_order, crashed_index)
# Services started after crashed service should restart
services_after = Enum.drop(supervision_order, crashed_index + 1)
# Wait for crashed service DOWN message
{crashed_pid, crashed_ref} = monitors[crashed_service]
assert_receive {:DOWN, ^crashed_ref, :process, ^crashed_pid, :killed}, 2000
# Wait for dependent services DOWN messages (supervisor shutdown)
for service <- services_after do
{pid, ref} = monitors[service]
assert_receive {:DOWN, ^ref, :process, ^pid, reason}, 2000
assert reason in [:shutdown, :killed]
end
# Verify services before remain alive
for service <- services_before do
{pid, _ref} = monitors[service]
assert Process.alive?(pid), "#{service} should remain alive (started before crashed service)"
end
:ok
end
# Private helpers
defp service_name_to_module(:task_pool_manager), do: JidoFoundation.TaskPoolManager
defp service_name_to_module(:system_command_manager), do: JidoFoundation.SystemCommandManager
defp service_name_to_module(:coordination_manager), do: JidoFoundation.CoordinationManager
defp service_name_to_module(:scheduler_manager), do: JidoFoundation.SchedulerManager
end
1.3 Create Supervision Test Setup
File: test/support/supervision_test_setup.ex
defmodule Foundation.SupervisionTestSetup do
@moduledoc """
Setup infrastructure for isolated supervision testing.
Creates complete JidoFoundation supervision trees in isolation,
allowing crash recovery tests without global state contamination.
"""
require Logger
import Foundation.AsyncTestHelpers
@doc """
Create an isolated supervision tree for testing JidoFoundation crash recovery.
Returns supervision context with:
- Isolated JidoSystem.Supervisor instance
- Test-specific Registry for service discovery
- Service PIDs accessible via helper functions
- Proper cleanup on test exit
"""
def create_isolated_supervision do
# Create unique test identifier
test_id = :erlang.unique_integer([:positive])
test_registry = :"test_jido_registry_#{test_id}"
test_supervisor = :"test_jido_supervisor_#{test_id}"
# Create test-specific registry
{:ok, registry_pid} = Registry.start_link(
keys: :unique,
name: test_registry,
partitions: 1
)
# Start isolated JidoFoundation supervision tree
{:ok, supervisor_pid} = start_isolated_jido_supervisor(test_supervisor, test_registry)
# Wait for all services to be registered and stable
services = [
JidoFoundation.TaskPoolManager,
JidoFoundation.SystemCommandManager,
JidoFoundation.CoordinationManager,
JidoFoundation.SchedulerManager
]
wait_for_services_ready(test_registry, services)
supervision_context = %{
test_id: test_id,
registry: test_registry,
registry_pid: registry_pid,
supervisor: test_supervisor,
supervisor_pid: supervisor_pid,
services: services
}
# Setup cleanup
on_exit(fn ->
cleanup_isolated_supervision(supervision_context)
end)
%{supervision_tree: supervision_context}
end
defp start_isolated_jido_supervisor(supervisor_name, registry_name) do
# Create isolated version of JidoSystem.Application children
# but with test-specific registry registration
children = [
# 1. State persistence supervisor
{JidoSystem.Agents.StateSupervisor, name: :"#{supervisor_name}_state_supervisor"},
# 2. Test-specific registries
{Registry, keys: :unique, name: :"#{supervisor_name}_monitor_registry"},
{Registry, keys: :unique, name: :"#{supervisor_name}_workflow_registry"},
# 3. Core infrastructure services (isolated instances)
{JidoSystem.ErrorStore, name: :"#{supervisor_name}_error_store"},
{JidoSystem.HealthMonitor, name: :"#{supervisor_name}_health_monitor"},
# 4. Manager services with test-specific registration
{JidoFoundation.SchedulerManager,
name: :"#{supervisor_name}_scheduler_manager",
registry: registry_name},
{JidoFoundation.TaskPoolManager,
name: :"#{supervisor_name}_task_pool_manager",
registry: registry_name},
{JidoFoundation.SystemCommandManager,
name: :"#{supervisor_name}_system_command_manager",
registry: registry_name},
{JidoFoundation.CoordinationManager,
name: :"#{supervisor_name}_coordination_manager",
registry: registry_name},
# 5. Additional supervisors
{JidoFoundation.MonitorSupervisor, name: :"#{supervisor_name}_monitor_supervisor"},
{JidoSystem.Supervisors.WorkflowSupervisor, name: :"#{supervisor_name}_workflow_supervisor"},
# 6. Dynamic supervisor for agents
{DynamicSupervisor,
name: :"#{supervisor_name}_agent_supervisor",
strategy: :one_for_one}
]
opts = [
strategy: :rest_for_one, # Same as production for accurate testing
name: supervisor_name,
max_restarts: 3,
max_seconds: 5
]
Supervisor.start_link(children, opts)
end
defp wait_for_services_ready(registry_name, services) do
for service_module <- services do
wait_for(
fn ->
case Registry.lookup(registry_name, {:service, service_module}) do
[{pid, _}] when is_pid(pid) and Process.alive?(pid) -> pid
_ -> nil
end
end,
10_000
)
end
end
defp cleanup_isolated_supervision(context) do
# Terminate supervisor tree gracefully
if Process.alive?(context.supervisor_pid) do
Supervisor.stop(context.supervisor_pid, :normal, 5000)
end
# Terminate registry
if Process.alive?(context.registry_pid) do
GenServer.stop(context.registry_pid, :normal, 5000)
end
# Wait for cleanup to complete
wait_for(
fn ->
not Process.alive?(context.supervisor_pid) and
not Process.alive?(context.registry_pid)
end,
5000
)
Logger.debug("Cleaned up isolated supervision tree: #{context.test_id}")
end
end
Phase 2: Service Registration Enhancement
2.1 Modify JidoFoundation Services for Test Registration
File: lib/jido_foundation/task_pool_manager.ex
defmodule JidoFoundation.TaskPoolManager do
use GenServer
# Add support for test-specific registry registration
def start_link(opts \\ []) do
name = Keyword.get(opts, :name, __MODULE__)
registry = Keyword.get(opts, :registry, nil)
GenServer.start_link(__MODULE__, {opts, registry}, name: name)
end
def init({opts, registry}) do
# Register with test registry if provided
if registry do
Registry.register(registry, {:service, __MODULE__}, %{test_instance: true})
end
# ... existing init logic
end
# ... rest of module unchanged
end
Apply same pattern to:
lib/jido_foundation/system_command_manager.ex
lib/jido_foundation/coordination_manager.ex
lib/jido_foundation/scheduler_manager.ex
Phase 3: Test Migration
3.1 Migrate Supervision Crash Recovery Test
File: test/jido_foundation/supervision_crash_recovery_test.exs
defmodule JidoFoundation.SupervisionCrashRecoveryTest do
@moduledoc """
Comprehensive supervision crash recovery tests with complete test isolation.
Uses Foundation.UnifiedTestFoundation :supervision_testing mode to create
isolated supervision trees per test, eliminating test contamination.
Each test gets its own JidoFoundation services, enabling proper crash
recovery testing without affecting other tests.
"""
use Foundation.UnifiedTestFoundation, :supervision_testing
require Logger
import Foundation.SupervisionTestHelpers
@moduletag :supervision_testing
@moduletag timeout: 30_000
describe "Isolated TaskPoolManager crash recovery" do
test "TaskPoolManager restarts after crash and maintains functionality",
%{supervision_tree: sup_tree} do
# Get TaskPoolManager from isolated supervision tree
{:ok, initial_pid} = get_service(sup_tree, :task_pool_manager)
assert is_pid(initial_pid)
# Test functionality before crash (using isolated instance)
# Note: Functions now work with isolated services via Registry lookup
stats = call_isolated_service(sup_tree, :task_pool_manager, :get_all_stats)
assert is_map(stats)
# Kill the isolated TaskPoolManager process
Process.exit(initial_pid, :kill)
# Wait for supervisor to restart it with new pid
{:ok, new_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, initial_pid)
# Verify it restarted with new pid
assert is_pid(new_pid)
assert new_pid != initial_pid
# Verify functionality is restored (isolated instance)
new_stats = call_isolated_service(sup_tree, :task_pool_manager, :get_all_stats)
assert is_map(new_stats)
# Test that pools can be created and used in isolated environment
case call_isolated_service(sup_tree, :task_pool_manager,
{:execute_batch, [:general, [1, 2, 3], fn x -> x * 2 end, [timeout: 1000]]}) do
{:ok, stream} ->
results = Enum.to_list(stream)
assert length(results) == 3
assert {:ok, 2} in results
assert {:ok, 4} in results
assert {:ok, 6} in results
{:error, :pool_not_found} ->
# Pool may not be ready yet after restart in isolated env
:ok
end
end
test "TaskPoolManager survives pool supervisor crashes",
%{supervision_tree: sup_tree} do
{:ok, manager_pid} = get_service(sup_tree, :task_pool_manager)
assert is_pid(manager_pid)
# Create a test pool in isolated environment
:ok = call_isolated_service(sup_tree, :task_pool_manager,
{:create_pool, [:test_crash_pool, %{max_concurrency: 2, timeout: 5000}]})
# Get pool stats to verify it's working (isolated)
{:ok, stats} = call_isolated_service(sup_tree, :task_pool_manager,
{:get_pool_stats, [:test_crash_pool]})
assert stats.max_concurrency == 2
# Test the pool with batch operation (isolated)
{:ok, stream} = call_isolated_service(sup_tree, :task_pool_manager,
{:execute_batch, [:test_crash_pool, [1, 2, 3], fn i -> i * 10 end, [timeout: 2000]]})
results = Enum.to_list(stream)
success_results = Enum.filter(results, fn
{:ok, _} -> true
_ -> false
end)
assert length(success_results) == 3
# Verify TaskPoolManager is still alive and functional (isolated)
assert Process.alive?(manager_pid)
{:ok, final_stats} = call_isolated_service(sup_tree, :task_pool_manager,
{:get_pool_stats, [:test_crash_pool]})
assert is_map(final_stats)
end
end
describe "rest_for_one supervision strategy testing" do
test "Service failures cause proper dependent restarts with :rest_for_one",
%{supervision_tree: sup_tree} do
# Monitor all services in isolated supervision tree
monitors = monitor_all_services(sup_tree)
# Kill TaskPoolManager in isolated environment
{task_pid, _} = monitors[:task_pool_manager]
Process.exit(task_pid, :kill)
# Verify rest_for_one cascade behavior in isolation
verify_rest_for_one_cascade(monitors, :task_pool_manager)
# Verify services are functioning after restart in isolated environment
{:ok, _} = get_service(sup_tree, :task_pool_manager)
{:ok, _} = get_service(sup_tree, :system_command_manager)
{:ok, _} = get_service(sup_tree, :coordination_manager)
# SchedulerManager should have same PID (not restarted)
{original_sched_pid, _} = monitors[:scheduler_manager]
{:ok, current_sched_pid} = get_service(sup_tree, :scheduler_manager)
assert original_sched_pid == current_sched_pid
end
test "Multiple simultaneous crashes don't bring down the system",
%{supervision_tree: sup_tree} do
# Get initial service PIDs from isolated tree
{:ok, task_pid} = get_service(sup_tree, :task_pool_manager)
{:ok, sys_pid} = get_service(sup_tree, :system_command_manager)
{:ok, sched_pid} = get_service(sup_tree, :scheduler_manager)
# Monitor for proper shutdown detection
task_ref = Process.monitor(task_pid)
sys_ref = Process.monitor(sys_pid)
sched_ref = Process.monitor(sched_pid)
# Kill multiple services simultaneously in isolated environment
Process.exit(task_pid, :kill)
Process.exit(sys_pid, :kill)
Process.exit(sched_pid, :kill)
# Wait for DOWN messages
assert_receive {:DOWN, ^task_ref, :process, ^task_pid, :killed}, 2000
assert_receive {:DOWN, ^sys_ref, :process, ^sys_pid, :killed}, 2000
assert_receive {:DOWN, ^sched_ref, :process, ^sched_pid, :killed}, 2000
# Wait for all services to restart in isolated environment
{:ok, new_task_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, task_pid, 8000)
{:ok, new_sys_pid} = wait_for_service_restart(sup_tree, :system_command_manager, sys_pid, 8000)
{:ok, new_sched_pid} = wait_for_service_restart(sup_tree, :scheduler_manager, sched_pid, 8000)
# Verify all services restarted with new PIDs
assert new_task_pid != task_pid
assert new_sys_pid != sys_pid
assert new_sched_pid != sched_pid
# Verify functionality is restored in isolated environment
stats = call_isolated_service(sup_tree, :task_pool_manager, :get_all_stats)
case stats do
{:ok, _stats} -> :ok
stats when is_map(stats) -> :ok
_other -> flunk("Could not get TaskPoolManager stats in isolated environment")
end
end
end
describe "Process leak validation in isolated environment" do
test "No process leaks after service crashes and restarts",
%{supervision_tree: sup_tree} do
initial_count = :erlang.system_info(:process_count)
# Test crash/restart cycle in isolated environment
{:ok, task_pid} = get_service(sup_tree, :task_pool_manager)
# Monitor the process before killing it
ref = Process.monitor(task_pid)
Process.exit(task_pid, :kill)
# Wait for the DOWN message
assert_receive {:DOWN, ^ref, :process, ^task_pid, :killed}, 1000
# Wait for restart in isolated supervision tree
{:ok, new_pid} = wait_for_service_restart(sup_tree, :task_pool_manager, task_pid, 5000)
assert is_pid(new_pid)
assert new_pid != task_pid
# Allow stabilization
Process.sleep(1000)
final_count = :erlang.system_info(:process_count)
# Process count should be stable (isolated environment)
assert final_count - initial_count < 20,
"Process count increased significantly: #{initial_count} -> #{final_count}"
end
end
# Helper function to call services in isolated supervision tree
defp call_isolated_service(sup_tree, service_name, function_or_call) do
{:ok, pid} = get_service(sup_tree, service_name)
case function_or_call do
atom when is_atom(atom) ->
GenServer.call(pid, atom)
{function, args} when is_atom(function) and is_list(args) ->
apply(GenServer, :call, [pid, {function, args}])
_ ->
GenServer.call(pid, function_or_call)
end
end
end
Phase 4: Registry Integration
4.1 Service Discovery Helper
File: test/support/isolated_service_discovery.ex
defmodule Foundation.IsolatedServiceDiscovery do
@moduledoc """
Service discovery utilities for isolated supervision testing.
Provides transparent access to JidoFoundation services running
in isolated supervision trees during testing.
"""
@doc """
Call a JidoFoundation service function in isolated test environment.
Automatically routes calls to the correct isolated service instance
based on the current test's supervision context.
## Examples
# Instead of:
TaskPoolManager.get_all_stats()
# Use in isolated tests:
call_service(sup_tree, TaskPoolManager, :get_all_stats)
# Or with arguments:
call_service(sup_tree, TaskPoolManager, :create_pool, [:test_pool, %{max_concurrency: 4}])
"""
def call_service(supervision_context, service_module, function, args \\ []) do
case Registry.lookup(supervision_context.registry, {:service, service_module}) do
[{pid, _}] when is_pid(pid) ->
case args do
[] -> GenServer.call(pid, function)
_ -> GenServer.call(pid, {function, args})
end
[] ->
{:error, {:service_not_found, service_module}}
end
end
@doc """
Cast to a JidoFoundation service in isolated test environment.
"""
def cast_service(supervision_context, service_module, message) do
case Registry.lookup(supervision_context.registry, {:service, service_module}) do
[{pid, _}] when is_pid(pid) ->
GenServer.cast(pid, message)
[] ->
{:error, {:service_not_found, service_module}}
end
end
end
Phase 5: Migration Checklist
5.1 Pre-Migration Verification
# 1. Verify current test behavior
mix test test/jido_foundation/supervision_crash_recovery_test.exs --trace
# 2. Run individual tests to confirm they pass
for test_line in 80 132 174 216 234 264 293 355 421 482 530 582 650 747 806; do
echo "Testing line $test_line"
mix test test/jido_foundation/supervision_crash_recovery_test.exs:$test_line
done
# 3. Confirm batch failure
mix test test/jido_foundation/supervision_crash_recovery_test.exs --seed 123456
5.2 Migration Steps
Create Foundation Infrastructure
# Create new files touch test/support/supervision_test_helpers.ex touch test/support/supervision_test_setup.ex touch test/support/isolated_service_discovery.ex
Enhance UnifiedTestFoundation
# Modify existing file $EDITOR test/support/unified_test_foundation.ex
Update JidoFoundation Services
# Add registry support to services $EDITOR lib/jido_foundation/task_pool_manager.ex $EDITOR lib/jido_foundation/system_command_manager.ex $EDITOR lib/jido_foundation/coordination_manager.ex $EDITOR lib/jido_foundation/scheduler_manager.ex
Migrate Test File
# Backup original cp test/jido_foundation/supervision_crash_recovery_test.exs \ test/jido_foundation/supervision_crash_recovery_test.exs.backup # Apply new implementation $EDITOR test/jido_foundation/supervision_crash_recovery_test.exs
5.3 Post-Migration Verification
# 1. Verify individual tests still pass with new implementation
for test_line in 80 132 174 216 234 264 293 355 421 482 530 582 650 747 806; do
echo "Testing isolated line $test_line"
mix test test/jido_foundation/supervision_crash_recovery_test.exs:$test_line
done
# 2. Verify batch tests now pass (KEY SUCCESS METRIC)
mix test test/jido_foundation/supervision_crash_recovery_test.exs --seed 123456
# 3. Run multiple times to ensure stability
for i in {1..5}; do
echo "Batch test run $i"
mix test test/jido_foundation/supervision_crash_recovery_test.exs
done
# 4. Performance verification (should be similar or better)
time mix test test/jido_foundation/supervision_crash_recovery_test.exs
Benefits of Complete Test Isolation
1. Eliminated Test Contamination
- ✅ Each test gets fresh supervision tree
- ✅ No shared global state between tests
- ✅ Tests can run in any order safely
- ✅ Parallel execution becomes possible (async: true)
2. Improved OTP Compliance
- ✅ Tests follow proper OTP supervision testing patterns
- ✅ No pollution of production supervision trees
- ✅ Accurate simulation of real supervision behavior
- ✅ Clean separation of test and production concerns
3. Enhanced Debugging
- ✅ Test failures are isolated and reproducible
- ✅ No mysterious cascade failures from other tests
- ✅ Clear process lifecycle visibility per test
- ✅ Deterministic behavior regardless of test order
4. Better Performance
- ✅ No waiting for global state cleanup between tests
- ✅ Potential for parallel execution with
async: true
- ✅ Faster test startup (no shared resource contention)
- ✅ Consistent execution times
5. Maintainability
- ✅ Clear test isolation boundaries
- ✅ Easy to add new supervision tests
- ✅ No complex inter-test dependencies
- ✅ Self-documenting test behavior
Success Metrics
Pre-Migration (Current State)
- ❌ Batch Tests: Fail with
(EXIT from #PID<0.95.0>) shutdown
- ✅ Individual Tests: All 15 tests pass when run alone
- ❌ Test Order Dependency: Tests fail when run in certain orders
- ❌ Reproducibility: Batch failures are intermittent
Post-Migration (Target State)
- ✅ Batch Tests: All 15 tests pass when run together
- ✅ Individual Tests: All 15 tests continue to pass alone
- ✅ Test Order Independence: Tests pass in any order
- ✅ Reproducibility: Consistent behavior across all runs
- ✅ Performance: Similar or improved execution time
- ✅ OTP Compliance: Proper supervision testing patterns
Implementation Timeline
Week 1: Foundation Infrastructure
- Day 1-2: Create
SupervisionTestHelpers
andSupervisionTestSetup
- Day 3-4: Enhance
UnifiedTestFoundation
with:supervision_testing
mode - Day 5: Create
IsolatedServiceDiscovery
utilities
Week 2: Service Enhancement
- Day 1-3: Add registry support to all JidoFoundation services
- Day 4-5: Test service isolation functionality
Week 3: Test Migration
- Day 1-3: Migrate supervision crash recovery test
- Day 4-5: Verification and performance testing
Week 4: Documentation and Optimization
- Day 1-2: Complete documentation and examples
- Day 3-5: Performance optimization and final verification
Risk Mitigation
1. Backward Compatibility
- Keep original test file as backup
- Ensure production services work unchanged
- Gradual migration with fallback options
2. Performance Impact
- Monitor test execution times
- Optimize supervision tree startup
- Consider lazy service initialization
3. Complexity Management
- Comprehensive documentation
- Clear helper function interfaces
- Step-by-step migration guide
4. Integration Issues
- Test with existing CI/CD pipelines
- Verify with different Elixir/OTP versions
- Validate with production-like loads
Result: Complete elimination of test contamination through proper OTP supervision testing patterns, enabling reliable and maintainable crash recovery testing.