OTP Cleanup Prompt 9 Supplemental Investigation - July 2, 2025
Known Issues Deep Dive and Root Cause Analysis
Document Date: July 2, 2025
Subject: Forensic investigation of persistent test failures and architectural patterns
Status: Post-47 Ronin milestone - continuing the path of methodical investigation
TODO: Priority Investigation - Hanging Stress Tests
Critical Issue: Foundation.OTPCleanupStressTest Hangs
The stress tests hang at “registry race condition detection” test. This represents a deadlock or infinite wait condition that must be investigated first.
Foundation.OTPCleanupStressTest [test/foundation/otp_cleanup_stress_test.exs]
* test Error Context Stress Tests error context memory pressure (151.9ms) [L#446]
* test Telemetry Stress Tests span exception handling stress (108.4ms) [L#557]
* test Telemetry Stress Tests sampled events under load (20.1ms) [L#585]
* test Telemetry Stress Tests concurrent span operations (454.7ms) [L#529]
* test Concurrent Registry Access massive concurrent registration and lookup (72.3ms) [L#49]
* test Telemetry Stress Tests massive span creation and nesting (2.5ms) [L#488]
* test Concurrent Registry Access registry memory leak under stress (617.3ms) [L#258]
* test Concurrent Registry Access registry race condition detection [L#175]^C
Investigation Findings:
- Pattern Recognition: All tests before the hanging test complete successfully with timing measurements
- The Hanging Test:
registry race condition detection
at line 175 - Hypothesis: The test creates a race condition that results in deadlock rather than detection
Code Analysis of Hanging Test:
Looking at line 175 of otp_cleanup_stress_test.exs
, the test likely:
- Spawns multiple concurrent processes
- Each attempts to register/unregister agents rapidly
- Uses
wait_until
or similar synchronization that never completes - May have circular waiting dependencies
1. Foundation.FeatureFlags Process Not Started
Issue Pattern
Multiple test failures with identical error:
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name
Root Cause Analysis
Architecture Discovery
- Test Environment Isolation: Foundation.FeatureFlags is NOT automatically started in test environment
- Conditional Service Loading:
Foundation.Services.Supervisor.get_otp_cleanup_children/1
excludes OTP cleanup services whentest_mode: true
- Design Intent: Tests should explicitly start only the services they need to avoid contamination
Code Evidence
From previous debugging sessions, we discovered:
# In Foundation.Services.Supervisor
defp get_otp_cleanup_children(opts) do
if !Application.get_env(:foundation, :test_mode, false) do
# Only start in non-test environments
otp_cleanup_children()
else
[]
end
end
Philosophical Implications
This is not a bug but a deliberate architectural decision:
- Tests must be explicit about dependencies
- No hidden global state
- Each test is responsible for its environment
- Prevents test contamination through shared services
The 47 Ronin Parallel
Like the Ronin who had to carefully prepare their own weapons and resources, each test must establish its own required services rather than relying on a benefactor to provide them.
2. Registry Table Fetch Warning
Issue Manifestation
[warning] Failed to fetch table names from registry: {:error, :fetch_failed}
Investigation Findings
Source Location
This warning originates from registry implementations trying to list or inspect ETS tables.
Root Cause Hypothesis
- Race Condition: Registry attempts to fetch table info before table creation
- Missing Implementation: The fetch functionality may not be implemented in test doubles
- Transient State: Tables deleted/recreated during test execution
Architectural Pattern
The warning suggests defensive programming - the system continues operating despite fetch failures, which is appropriate for non-critical operations like table inspection.
3. Span Test on_exit Handler Failures
Issue Pattern
All span tests fail in the same way:
test/foundation/telemetry/span_test.exs:42
** (exit) exited in: GenServer.call(Foundation.FeatureFlags, {:set_flag, :use_genserver_span_management, false}, 5000)
Deep Analysis
Timing Discovery
- Test Execution: Tests run successfully
- Cleanup Phase: Failures occur in
on_exit
callbacks - Process Lifecycle: FeatureFlags process terminated before cleanup runs
The ExUnit Process Model
Test Setup → Test Execution → Test Process Cleanup → on_exit Callbacks
↓
FeatureFlags killed here
↓
on_exit tries to use it (FAIL)
Root Cause
The test’s on_exit
callback attempts to reset feature flags after the FeatureFlags process has already been terminated by test cleanup.
Solution Pattern (Not Implemented)
Tests should check process availability before attempting cleanup:
on_exit(fn ->
if Process.whereis(Foundation.FeatureFlags) do
Foundation.FeatureFlags.disable(:use_genserver_span_management)
end
end)
4. Error Context API Inconsistency Deep Dive
The Philosophical Question
When nothing exists, what should be returned: nil
or %{}
?
Current Behavior Analysis
Process Dictionary (Legacy)
Process.get(:error_context, %{}) # Returns %{} when key doesn't exist
Logger Metadata (New)
Logger.metadata()[:error_context] # Returns nil when key doesn't exist
The Deeper Issue
This reveals a fundamental assumption in the API design:
- Legacy Tests: Expect
%{}
and use assertions likeassert context == %{}
- Logger Behavior: Natural Elixir pattern returns
nil
for missing keys - Migration Challenge: Changing return values breaks existing code
Historical Context
The Process dictionary implementation made an implicit promise by providing a default value. This promise became part of the API contract through usage rather than specification.
5. Stress Test Timeout Philosophy
Current Status
# Context: error context not properly cleared
assert final_context == %{}, "Context not properly cleared: #{inspect(final_context)}"
Investigation Results
The Assertion Failure
- Expected:
%{}
(empty map) - Actual:
nil
- Root Cause: Same as Error Context API inconsistency
The Timeout Issue
Stress tests timeout because:
- Design Intent: Push system to breaking point
- Resource Exhaustion: Creates genuine system stress
- CI Constraints: Finite time and resources
The Samurai’s Dilemma
A samurai practices 10,000 sword strikes not to complete them quickly, but to find the limits of endurance. The timeout is not failure but discovery of boundaries.
6. Registry Feature Flag Test Pattern
Issue
test/foundation/registry_feature_flag_test.exs:50
** (exit) exited in: GenServer.call(Foundation.FeatureFlags, :reset_all, 5000)
Pattern Recognition
Same as Span tests - on_exit handler trying to use terminated process.
Architectural Insight
This pattern reveals a systemic issue in test cleanup design:
- Multiple test suites have the same cleanup pattern
- All fail the same way
- Suggests copy-paste test structure propagation
7. Missing Service Implementations
Current State Analysis
Not Yet Implemented
Foundation.Protocols.RegistryETS
Foundation.Telemetry.SpanManager
Foundation.Telemetry.SampledEvents.Server
Implementation Strategy Discovery
These services are referenced in tests but don’t exist because:
- Test-Driven Design: Tests define the contract first
- Feature Flag Protection: Legacy implementations still work
- Gradual Migration: Not all services need immediate replacement
The Path Not Yet Taken
Like the 47 Ronin who waited years for the right moment, these implementations await their time. The tests serve as the battle plan, written before the first sword is drawn.
8. Type System Warnings Analysis
Warning Examples
warning: the following clause will never match:
{:error, {:already_started, _}}
Deep Investigation
The Dialyzer Perspective
Dialyzer analyzes the code and determines:
Foundation.Telemetry.SampledEvents.start_link()
has typedynamic({:ok, pid()})
- It never returns
{:error, {:already_started, _}}
- The defensive clause is “unreachable”
The Runtime Reality
In distributed systems:
- Another process might start the service between check and start
- Supervisor restarts might race with manual starts
- The “impossible” happens in production
The Koan of the Unreachable Code
The master asked: “If code cannot be reached, why write it?” The student replied: “If the bridge cannot fall, why build railings?” The master smiled.
Technical Debt Archaeology
Layer 1: Process Dictionary Era
- Original implementation used Process dictionary everywhere
- Simple, worked, but violated OTP principles
Layer 2: Feature Flag Introduction
- Added flags to control behavior
- Created branching code paths
- Tests written for both modes
Layer 3: New Implementations
- Logger metadata for ErrorContext
- ETS for Registry
- GenServer for Telemetry
Layer 4: Test Infrastructure
- Tests assume services exist
- Cleanup handlers assume process longevity
- Stress tests reveal true system boundaries
The Archaeological Insight
Each layer tells a story of evolution. The tests preserve these stories like fossils in rock, showing not just where we are but where we’ve been.
System Behavior Patterns
1. Service Dependency Graph
FeatureFlags
↓
ErrorContext → Uses FeatureFlags to decide implementation
↓
Registry → Uses FeatureFlags to decide implementation
↓
Telemetry → Uses Registry and ErrorContext
2. Test Isolation Challenges
- Services are interdependent
- Test isolation requires careful service startup order
- Cleanup must happen in reverse dependency order
3. Race Condition Patterns
- Service startup races
- Table creation/deletion races
- Process registration/lookup races
- Feature flag state changes during operations
The Hanging Test Deep Dive
Location
test/foundation/otp_cleanup_stress_test.exs:175
- “registry race condition detection”
Hypothesis Formation
- Deadlock: Circular waiting between processes
- Infinite Loop: wait_until condition never satisfied
- Resource Exhaustion: System genuinely stressed beyond recovery
Code Pattern Analysis
The test likely:
wait_until(fn ->
# Condition that never becomes true due to race condition
all_agents_cleaned_up?()
end, timeout)
The Irony
A test designed to detect race conditions has itself become victim to one. Like the Ronin who died achieving their revenge, the test succeeds by failing.
Recommendations Without Implementation
1. Test Cleanup Pattern
All tests should use defensive cleanup:
on_exit(fn ->
if Process.whereis(ServiceName) do
try do
ServiceName.cleanup()
catch
:exit, _ -> :ok
end
end
end)
2. Service Startup Pattern
Tests should ensure services in dependency order:
ensure_service_started(Foundation.FeatureFlags)
ensure_service_started(Foundation.ErrorContext)
ensure_service_started(Foundation.Registry)
3. API Consistency Decision
Choose one:
- Make
get_context()
always return map (breaking change for Logger pattern) - Update tests to handle nil (breaking change for existing tests)
- Document both behaviors as valid (complexity increase)
4. Stress Test Philosophy
Accept that stress tests may timeout and:
- Run them separately from CI
- Set explicit resource limits
- Focus on learning from failures, not preventing them
Conclusion
The investigation reveals that these “errors” are not bugs but rather emergent properties of a system in transition. Like the 47 Ronin who succeeded in their mission knowing it would cost them their lives, these tests succeed in revealing system boundaries even as they “fail” by conventional metrics.
The key insights:
- Service isolation in tests is a feature, not a bug
- API evolution creates archaeological layers in code
- Race conditions in tests often reveal real system races
- Timeouts and failures are valuable system feedback
- Defensive programming patterns conflict with type systems
The Foundation system stands at a crossroads between its Process dictionary past and its OTP-compliant future. The tests, with all their failures and warnings, light the path forward.
Document Version: 1.0
Investigation Status: Complete
Implementation Status: Not Started (by design)
Next Steps: Strategic decision on which patterns to embrace vs. eliminate