Sleep Test Fixes Analysis #2 - Root Cause Investigation
Executive Summary
After replacing 18 sleep instances with deterministic patterns, we’ve uncovered both existing concurrency issues and potentially introduced new ones. This document provides a deep investigation into the root causes of test failures following sleep removal.
Critical Failures Observed
1. Integration Test - Task Process Structure Error
** (ArgumentError) errors were found at the given arguments:
* 2nd argument: not a tuple
(erts 15.2.6) :erlang.element(2, %Task{...})
Location: integration_validation_test.exs:298
Pattern: Attempting to use elem(&1, 1)
on Task structs
2. ResourceManager Test - Already Started Errors
** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<...>}}
Location: Multiple ResourceManager tests Pattern: Process already running when test expects to start it
3. GenServer Termination Issues
[error] GenServer :TaskPool_load_test_3_Supervisor terminating
** (stop) killed
Pattern: Supervisors being killed during test execution
Root Cause Analysis
Issue 1: Task Structure Misunderstanding
The Problem:
# INCORRECT CODE:
if Enum.all?(background_tasks, &Process.alive?(elem(&1, 1))) do
Analysis:
Task.async/1
returns a%Task{}
struct, not a tuple- The PID is accessed via
task.pid
, notelem(task, 1)
- This is a code error introduced by our fix, not a concurrency issue
Evidence:
- Error happens consistently at the same line
- ArgumentError clearly states “not a tuple”
- Task struct shape:
%Task{pid: pid, ref: ref, owner: owner}
Issue 2: ResourceManager Process Lifecycle
The Problem:
# In setup:
if pid = Process.whereis(Foundation.ResourceManager) do
GenServer.stop(pid)
wait_for(fn ->
case Process.whereis(Foundation.ResourceManager) do
nil -> true
_ -> nil
end
end, 1000)
end
# Later:
{:ok, _pid} = ResourceManager.start_link() # FAILS: already_started
Analysis:
- ResourceManager is supervised and auto-restarts after GenServer.stop
- The
wait_for
checking fornil
succeeds briefly, but supervisor restarts it - By the time we call
start_link
, it’s already restarted - This is an existing concurrency issue exposed by removing sleep
Evidence:
- Multiple tests fail with same “already_started” error
- Error PIDs are different each time (supervisor creating new processes)
- Pattern suggests race between test teardown and supervisor restart
Issue 3: Supervisor Cascade Failures
Analysis:
- Task pool supervisors are being killed during test runs
- Likely caused by test isolation issues when multiple tests run concurrently
- Sleep was masking race conditions in supervisor tree management
Hypothesis Testing Plan
Test 1: Verify Task Structure Issue
# Add instrumentation to confirm Task structure:
IO.inspect(background_tasks, label: "background_tasks structure")
IO.inspect(Enum.map(background_tasks, &{&1.__struct__, Map.keys(&1)}), label: "task fields")
Test 2: ResourceManager Lifecycle
# Add telemetry to track ResourceManager lifecycle:
:telemetry.attach(
"rm-lifecycle-test",
[:foundation, :resource_manager, :lifecycle],
fn event, measurements, metadata, _config ->
IO.puts("ResourceManager #{event}: #{inspect(metadata)}")
end,
nil
)
Test 3: Supervisor Restart Behavior
# Monitor supervisor restarts:
ref = Process.monitor(Process.whereis(Foundation.ResourceManager))
GenServer.stop(pid)
receive do
{:DOWN, ^ref, :process, _, reason} ->
IO.puts("ResourceManager DOWN: #{inspect(reason)}")
after
100 -> IO.puts("No DOWN message")
end
# Then check for restart
wait_for(fn ->
new_pid = Process.whereis(Foundation.ResourceManager)
IO.puts("ResourceManager check: #{inspect(new_pid)}")
new_pid != nil && new_pid != pid
end, 1000)
Proper OTP Patterns to Apply
1. Task Handling
# CORRECT: Access Task struct fields properly
if Enum.all?(background_tasks, fn task ->
Process.alive?(task.pid)
end) do
2. Supervised Process Management
# CORRECT: Don't fight the supervisor
case Process.whereis(Foundation.ResourceManager) do
nil ->
# Start it
{:ok, _pid} = ResourceManager.start_link()
pid ->
# It's already running, just ensure clean state
:ok = ResourceManager.reset_state() # If such function exists
end
3. Test Isolation
# Use unique names for test processes:
test_name = :"resource_manager_#{System.unique_integer()}"
{:ok, _pid} = ResourceManager.start_link(name: test_name)
Telemetry Integration Opportunities
1. Process Lifecycle Events
:telemetry.execute(
[:foundation, :process, :started],
%{timestamp: System.monotonic_time()},
%{module: __MODULE__, pid: self()}
)
2. Supervisor Events
:telemetry.execute(
[:foundation, :supervisor, :child_restarted],
%{restart_count: restart_count},
%{child_id: child_id, reason: reason}
)
3. Test Synchronization Points
:telemetry.execute(
[:foundation, :test, :checkpoint],
%{},
%{test: test_name, checkpoint: :processes_ready}
)
Recommended Fix Strategy
Phase 1: Fix Introduced Bugs (IMMEDIATE)
- Fix Task struct access in integration_validation_test.exs
- Verify no other
elem(task, N)
patterns exist
Phase 2: Address Process Lifecycle (HIGH PRIORITY)
- Implement proper test isolation for ResourceManager
- Either:
- Use unique process names per test
- Or work with existing supervised process
- Add lifecycle telemetry for debugging
Phase 3: Supervisor Coordination (MEDIUM PRIORITY)
- Investigate supervisor restart strategies
- Ensure test cleanup doesn’t fight supervisors
- Add supervisor telemetry events
Phase 4: Systematic Verification
- Run each fixed test in isolation
- Run tests concurrently to detect race conditions
- Add temporary instrumentation to confirm assumptions
Key Insights
Sleep was masking real issues: The sleep delays were hiding fundamental problems with process lifecycle management and supervisor interactions
Test isolation is critical: Tests are interfering with each other through shared supervised processes
OTP fights back: When you try to stop a supervised process, OTP restarts it - we must work with this, not against it
Deterministic != Correct: Our wait_for patterns are deterministic but revealed incorrect assumptions about process management
Conclusion
The test failures reveal both:
- Introduced bugs: Task struct access error (easy fix)
- Existing issues: Process lifecycle races, supervisor conflicts (harder fix)
The path forward requires:
- Immediate fixes for introduced bugs
- Proper OTP-compliant process management
- Better test isolation strategies
- Telemetry-based verification of assumptions
These failures are actually valuable - they’ve exposed fundamental issues that sleep was hiding. Fixing them properly will result in a much more robust test suite.