← Back to Foundation

JULY 1 2025 SLEEP TEST FIXES 02

Documentation for JULY_1_2025_SLEEP_TEST_FIXES_02 from the Foundation repository.

Sleep Test Fixes Analysis #2 - Root Cause Investigation

Executive Summary

After replacing 18 sleep instances with deterministic patterns, we’ve uncovered both existing concurrency issues and potentially introduced new ones. This document provides a deep investigation into the root causes of test failures following sleep removal.

Critical Failures Observed

1. Integration Test - Task Process Structure Error

** (ArgumentError) errors were found at the given arguments:
  * 2nd argument: not a tuple
(erts 15.2.6) :erlang.element(2, %Task{...})

Location: integration_validation_test.exs:298 Pattern: Attempting to use elem(&1, 1) on Task structs

2. ResourceManager Test - Already Started Errors

** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<...>}}

Location: Multiple ResourceManager tests Pattern: Process already running when test expects to start it

3. GenServer Termination Issues

[error] GenServer :TaskPool_load_test_3_Supervisor terminating
** (stop) killed

Pattern: Supervisors being killed during test execution

Root Cause Analysis

Issue 1: Task Structure Misunderstanding

The Problem:

# INCORRECT CODE:
if Enum.all?(background_tasks, &Process.alive?(elem(&1, 1))) do

Analysis:

  • Task.async/1 returns a %Task{} struct, not a tuple
  • The PID is accessed via task.pid, not elem(task, 1)
  • This is a code error introduced by our fix, not a concurrency issue

Evidence:

  • Error happens consistently at the same line
  • ArgumentError clearly states “not a tuple”
  • Task struct shape: %Task{pid: pid, ref: ref, owner: owner}

Issue 2: ResourceManager Process Lifecycle

The Problem:

# In setup:
if pid = Process.whereis(Foundation.ResourceManager) do
  GenServer.stop(pid)
  wait_for(fn ->
    case Process.whereis(Foundation.ResourceManager) do
      nil -> true
      _ -> nil
    end
  end, 1000)
end

# Later:
{:ok, _pid} = ResourceManager.start_link()  # FAILS: already_started

Analysis:

  • ResourceManager is supervised and auto-restarts after GenServer.stop
  • The wait_for checking for nil succeeds briefly, but supervisor restarts it
  • By the time we call start_link, it’s already restarted
  • This is an existing concurrency issue exposed by removing sleep

Evidence:

  • Multiple tests fail with same “already_started” error
  • Error PIDs are different each time (supervisor creating new processes)
  • Pattern suggests race between test teardown and supervisor restart

Issue 3: Supervisor Cascade Failures

Analysis:

  • Task pool supervisors are being killed during test runs
  • Likely caused by test isolation issues when multiple tests run concurrently
  • Sleep was masking race conditions in supervisor tree management

Hypothesis Testing Plan

Test 1: Verify Task Structure Issue

# Add instrumentation to confirm Task structure:
IO.inspect(background_tasks, label: "background_tasks structure")
IO.inspect(Enum.map(background_tasks, &{&1.__struct__, Map.keys(&1)}), label: "task fields")

Test 2: ResourceManager Lifecycle

# Add telemetry to track ResourceManager lifecycle:
:telemetry.attach(
  "rm-lifecycle-test",
  [:foundation, :resource_manager, :lifecycle],
  fn event, measurements, metadata, _config ->
    IO.puts("ResourceManager #{event}: #{inspect(metadata)}")
  end,
  nil
)

Test 3: Supervisor Restart Behavior

# Monitor supervisor restarts:
ref = Process.monitor(Process.whereis(Foundation.ResourceManager))
GenServer.stop(pid)
receive do
  {:DOWN, ^ref, :process, _, reason} ->
    IO.puts("ResourceManager DOWN: #{inspect(reason)}")
after
  100 -> IO.puts("No DOWN message")
end

# Then check for restart
wait_for(fn ->
  new_pid = Process.whereis(Foundation.ResourceManager)
  IO.puts("ResourceManager check: #{inspect(new_pid)}")
  new_pid != nil && new_pid != pid
end, 1000)

Proper OTP Patterns to Apply

1. Task Handling

# CORRECT: Access Task struct fields properly
if Enum.all?(background_tasks, fn task -> 
  Process.alive?(task.pid) 
end) do

2. Supervised Process Management

# CORRECT: Don't fight the supervisor
case Process.whereis(Foundation.ResourceManager) do
  nil -> 
    # Start it
    {:ok, _pid} = ResourceManager.start_link()
  pid ->
    # It's already running, just ensure clean state
    :ok = ResourceManager.reset_state()  # If such function exists
end

3. Test Isolation

# Use unique names for test processes:
test_name = :"resource_manager_#{System.unique_integer()}"
{:ok, _pid} = ResourceManager.start_link(name: test_name)

Telemetry Integration Opportunities

1. Process Lifecycle Events

:telemetry.execute(
  [:foundation, :process, :started],
  %{timestamp: System.monotonic_time()},
  %{module: __MODULE__, pid: self()}
)

2. Supervisor Events

:telemetry.execute(
  [:foundation, :supervisor, :child_restarted],
  %{restart_count: restart_count},
  %{child_id: child_id, reason: reason}
)

3. Test Synchronization Points

:telemetry.execute(
  [:foundation, :test, :checkpoint],
  %{},
  %{test: test_name, checkpoint: :processes_ready}
)

Phase 1: Fix Introduced Bugs (IMMEDIATE)

  1. Fix Task struct access in integration_validation_test.exs
  2. Verify no other elem(task, N) patterns exist

Phase 2: Address Process Lifecycle (HIGH PRIORITY)

  1. Implement proper test isolation for ResourceManager
  2. Either:
    • Use unique process names per test
    • Or work with existing supervised process
  3. Add lifecycle telemetry for debugging

Phase 3: Supervisor Coordination (MEDIUM PRIORITY)

  1. Investigate supervisor restart strategies
  2. Ensure test cleanup doesn’t fight supervisors
  3. Add supervisor telemetry events

Phase 4: Systematic Verification

  1. Run each fixed test in isolation
  2. Run tests concurrently to detect race conditions
  3. Add temporary instrumentation to confirm assumptions

Key Insights

  1. Sleep was masking real issues: The sleep delays were hiding fundamental problems with process lifecycle management and supervisor interactions

  2. Test isolation is critical: Tests are interfering with each other through shared supervised processes

  3. OTP fights back: When you try to stop a supervised process, OTP restarts it - we must work with this, not against it

  4. Deterministic != Correct: Our wait_for patterns are deterministic but revealed incorrect assumptions about process management

Conclusion

The test failures reveal both:

  • Introduced bugs: Task struct access error (easy fix)
  • Existing issues: Process lifecycle races, supervisor conflicts (harder fix)

The path forward requires:

  1. Immediate fixes for introduced bugs
  2. Proper OTP-compliant process management
  3. Better test isolation strategies
  4. Telemetry-based verification of assumptions

These failures are actually valuable - they’ve exposed fundamental issues that sleep was hiding. Fixing them properly will result in a much more robust test suite.