OTP Cleanup Prompt 9 Supplemental Analysis 03 - Remaining Test Failures Investigation
Generated: July 2, 2025
Executive Summary
Following the substantial progress made in debugging the OTP cleanup integration tests, this document investigates the remaining test failures across the broader Foundation codebase. While the core OTP cleanup tests achieved an 84.7% success rate, several related test suites show failures that need to be addressed for complete system stability.
Test Failure Categories
1. SpanManager Service Availability Issues
Foundation.Telemetry.SpanTest Failures
Pattern: Tests fail because SpanManager GenServer is not running
** (exit) exited in: GenServer.call(Foundation.Telemetry.SpanManager, {:clear_stack, #PID<0.1362.0>}, 5000)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name
Root Cause Analysis:
- SpanManager is part of the conditional supervision tree in
Foundation.Services.Supervisor
- It only starts when:
- Not in test mode (unless explicitly included)
- Feature flag
:use_genserver_span_management
is enabled - Module is loaded
Affected Tests:
test add_attributes/1 adds attributes to current span
- Expects telemetry events that aren’t emitted without SpanManagertest propagate_context/0 and with_propagated_context/2
- Context propagation returns nil when SpanManager isn’t runningtest add_attributes/1 logs warning when no active span
- Setup fails trying to clear span stack
Investigation Findings:
- The span implementation has dual mode: Process dictionary (legacy) vs SpanManager (new)
- When SpanManager isn’t running, the code falls back to Process dictionary
- Tests assume SpanManager is always available
2. FeatureFlags Service Lifecycle Issues
Pattern: Multiple test suites fail with FeatureFlags not running
** (exit) exited in: GenServer.call(Foundation.FeatureFlags, :reset_all, 5000)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name
Affected Test Files:
Foundation.RegistryFeatureFlagTest
Foundation.MigrationControlTest
Root Cause Analysis:
- FeatureFlags is a singleton service managed by Foundation.Services.Supervisor
- In test mode, it may not be started automatically
- Some test suites don’t properly ensure the service is started in setup
- When multiple test files run concurrently, they can interfere with each other’s FeatureFlags state
3. ETS Table Management Issues
Registry ETS Table Not Found
** (ArgumentError) errors were found at the given arguments:
* 1st argument: the table identifier does not refer to an existing ETS table
(stdlib 6.2.2) :ets.select(:foundation_agent_registry_ets, ...)
Investigation Findings:
- RegistryETS creates its table on first use, not on service start
- Tests that call
list_agents/0
before any registration fail - The table name mismatch:
:foundation_agent_registry
vs:foundation_agent_registry_ets
4. Error Context Test Failures
ErrorContext.with_context Exception Propagation
test Logger metadata implementation with_context/1 enhances errors with context on exception
** (RuntimeError) Test error
Investigation Findings:
- The test expects
with_context/2
to catch and enhance exceptions - Current implementation doesn’t wrap exceptions, it re-raises them
- This is actually correct behavior - the test expectation is wrong
5. Agent Termination Warnings
Pattern: Multiple agent termination logs with :normal
reason
[error] Elixir.JidoSystem.Agents.FinalPersistenceDemoTest.IncrementalPersistAgent server terminating
Reason: ** (ErlangError) Erlang error: :normal
Investigation Findings:
- These are not actual errors - agents are terminating normally
- The error logging is overly aggressive for normal terminations
- This is noise in the test output but not actual failures
6. Registry Transaction Warnings
Pattern: Transaction rollback warnings in tests
[warning] Serial operations tx_130_889834 failed: {:register_failed, "existing", :already_exists}.
ETS changes NOT rolled back.
Investigation Findings:
- These are expected failures in negative test cases
- The warning about “NOT rolled back” is concerning - indicates incomplete transaction support
- Registry operations are not properly atomic
Architectural Issues Discovered
1. Service Dependency Management
Problem: Services have implicit dependencies not enforced by supervision tree
- SpanManager depends on FeatureFlags but doesn’t declare it
- RegistryETS depends on FeatureFlags but can partially function without it
- No clear service startup order enforcement
Impact: Race conditions and service availability issues in tests
2. Dual Implementation Complexity
Problem: Having both legacy (Process dictionary) and new (GenServer/ETS) implementations creates:
- Complex conditional logic throughout the codebase
- Tests that work with one implementation but not the other
- Difficulty ensuring feature parity between implementations
Impact: Test fragility and maintenance burden
3. Test Isolation Issues
Problem: Tests share global state through:
- Singleton GenServers (FeatureFlags, SpanManager)
- Named ETS tables
- Logger metadata
Impact: Test interference and non-deterministic failures
4. Incomplete Migration Path
Problem: The OTP cleanup migration is partially complete:
- Some modules fully migrated (RegistryETS)
- Some modules dual-mode (Span, ErrorContext)
- Some modules still using Process dictionary
- Transaction support incomplete in new implementations
Impact: Inconsistent behavior across the system
Recommendations for Resolution
1. Immediate Fixes Needed
SpanTest Setup:
- Add
ensure_service_started(Foundation.Telemetry.SpanManager)
to setup - Handle both SpanManager and Process dictionary modes in tests
- Add
FeatureFlags Availability:
- Create
Foundation.TestHelpers.ensure_foundation_services/0
to start all services - Call from
test_helper.exs
or each test module’s setup
- Create
Registry ETS Table Creation:
- Initialize ETS table in
start_link/1
not on first use - Use consistent table names across implementations
- Initialize ETS table in
Error Context Test Fix:
- Update test to expect exception propagation
- Add separate test for error enhancement without exceptions
2. Architectural Improvements
Service Dependencies:
- Implement service dependency declarations
- Create startup ordering in Services.Supervisor
- Add health checks for dependent services
Test Mode Services:
- Create
Foundation.TestMode.Supervisor
with all services - Ensure consistent service availability in tests
- Add service reset capabilities for test isolation
- Create
Migration Completion:
- Set deadline for removing Process dictionary implementations
- Add migration status dashboard
- Complete transaction support in RegistryETS
3. Testing Strategy Improvements
Service Availability Tests:
- Add tests that verify behavior when services are unavailable
- Test graceful degradation paths
- Test service restart scenarios
Implementation Parity Tests:
- Create property-based tests that run against both implementations
- Verify identical behavior for all operations
- Test implementation switching during operation
Integration Test Isolation:
- Use unique ETS table names per test
- Create test-specific FeatureFlags instances
- Clear all global state between tests
Summary of Remaining Work
High Priority
- Fix SpanManager availability in span tests
- Ensure FeatureFlags service in all test setups
- Fix Registry ETS table initialization
- Update ErrorContext test expectations
Medium Priority
- Implement service dependency management
- Complete transaction support in RegistryETS
- Create comprehensive test helpers for service management
- Fix agent termination logging levels
Low Priority
- Remove Process dictionary implementations after migration
- Create migration status tracking
- Implement service health checks
- Add performance regression tests
Test Success Metrics
- Current OTP Cleanup Tests: 84.7% (61/72 passing)
- Affected Related Tests: ~15 failures across 7 test files
- Total System Impact: ~95% of tests still passing
The remaining issues are primarily service availability and test setup problems rather than fundamental architectural flaws. With proper service management and test isolation, the OTP cleanup migration can achieve 100% test success.
Document Version: 1.0
Investigation Date: July 2, 2025
Next Steps: Implement high-priority fixes in order listed above