← Back to Foundation

JULY 2 2025 OTP CLEANUP 2121 prompts prompt9 supplemental 03

Documentation for JULY_2_2025_OTP_CLEANUP_2121_prompts_prompt9_supplemental_03 from the Foundation repository.

OTP Cleanup Prompt 9 Supplemental Analysis 03 - Remaining Test Failures Investigation

Generated: July 2, 2025

Executive Summary

Following the substantial progress made in debugging the OTP cleanup integration tests, this document investigates the remaining test failures across the broader Foundation codebase. While the core OTP cleanup tests achieved an 84.7% success rate, several related test suites show failures that need to be addressed for complete system stability.

Test Failure Categories

1. SpanManager Service Availability Issues

Foundation.Telemetry.SpanTest Failures

Pattern: Tests fail because SpanManager GenServer is not running

** (exit) exited in: GenServer.call(Foundation.Telemetry.SpanManager, {:clear_stack, #PID<0.1362.0>}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name

Root Cause Analysis:

  • SpanManager is part of the conditional supervision tree in Foundation.Services.Supervisor
  • It only starts when:
    • Not in test mode (unless explicitly included)
    • Feature flag :use_genserver_span_management is enabled
    • Module is loaded

Affected Tests:

  1. test add_attributes/1 adds attributes to current span - Expects telemetry events that aren’t emitted without SpanManager
  2. test propagate_context/0 and with_propagated_context/2 - Context propagation returns nil when SpanManager isn’t running
  3. test add_attributes/1 logs warning when no active span - Setup fails trying to clear span stack

Investigation Findings:

  • The span implementation has dual mode: Process dictionary (legacy) vs SpanManager (new)
  • When SpanManager isn’t running, the code falls back to Process dictionary
  • Tests assume SpanManager is always available

2. FeatureFlags Service Lifecycle Issues

Pattern: Multiple test suites fail with FeatureFlags not running

** (exit) exited in: GenServer.call(Foundation.FeatureFlags, :reset_all, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name

Affected Test Files:

  • Foundation.RegistryFeatureFlagTest
  • Foundation.MigrationControlTest

Root Cause Analysis:

  • FeatureFlags is a singleton service managed by Foundation.Services.Supervisor
  • In test mode, it may not be started automatically
  • Some test suites don’t properly ensure the service is started in setup
  • When multiple test files run concurrently, they can interfere with each other’s FeatureFlags state

3. ETS Table Management Issues

Registry ETS Table Not Found

** (ArgumentError) errors were found at the given arguments:
    * 1st argument: the table identifier does not refer to an existing ETS table
    (stdlib 6.2.2) :ets.select(:foundation_agent_registry_ets, ...)

Investigation Findings:

  • RegistryETS creates its table on first use, not on service start
  • Tests that call list_agents/0 before any registration fail
  • The table name mismatch: :foundation_agent_registry vs :foundation_agent_registry_ets

4. Error Context Test Failures

ErrorContext.with_context Exception Propagation

test Logger metadata implementation with_context/1 enhances errors with context on exception
** (RuntimeError) Test error

Investigation Findings:

  • The test expects with_context/2 to catch and enhance exceptions
  • Current implementation doesn’t wrap exceptions, it re-raises them
  • This is actually correct behavior - the test expectation is wrong

5. Agent Termination Warnings

Pattern: Multiple agent termination logs with :normal reason

[error] Elixir.JidoSystem.Agents.FinalPersistenceDemoTest.IncrementalPersistAgent server terminating
Reason: ** (ErlangError) Erlang error: :normal

Investigation Findings:

  • These are not actual errors - agents are terminating normally
  • The error logging is overly aggressive for normal terminations
  • This is noise in the test output but not actual failures

6. Registry Transaction Warnings

Pattern: Transaction rollback warnings in tests

[warning] Serial operations tx_130_889834 failed: {:register_failed, "existing", :already_exists}. 
ETS changes NOT rolled back.

Investigation Findings:

  • These are expected failures in negative test cases
  • The warning about “NOT rolled back” is concerning - indicates incomplete transaction support
  • Registry operations are not properly atomic

Architectural Issues Discovered

1. Service Dependency Management

Problem: Services have implicit dependencies not enforced by supervision tree

  • SpanManager depends on FeatureFlags but doesn’t declare it
  • RegistryETS depends on FeatureFlags but can partially function without it
  • No clear service startup order enforcement

Impact: Race conditions and service availability issues in tests

2. Dual Implementation Complexity

Problem: Having both legacy (Process dictionary) and new (GenServer/ETS) implementations creates:

  • Complex conditional logic throughout the codebase
  • Tests that work with one implementation but not the other
  • Difficulty ensuring feature parity between implementations

Impact: Test fragility and maintenance burden

3. Test Isolation Issues

Problem: Tests share global state through:

  • Singleton GenServers (FeatureFlags, SpanManager)
  • Named ETS tables
  • Logger metadata

Impact: Test interference and non-deterministic failures

4. Incomplete Migration Path

Problem: The OTP cleanup migration is partially complete:

  • Some modules fully migrated (RegistryETS)
  • Some modules dual-mode (Span, ErrorContext)
  • Some modules still using Process dictionary
  • Transaction support incomplete in new implementations

Impact: Inconsistent behavior across the system

Recommendations for Resolution

1. Immediate Fixes Needed

  1. SpanTest Setup:

    • Add ensure_service_started(Foundation.Telemetry.SpanManager) to setup
    • Handle both SpanManager and Process dictionary modes in tests
  2. FeatureFlags Availability:

    • Create Foundation.TestHelpers.ensure_foundation_services/0 to start all services
    • Call from test_helper.exs or each test module’s setup
  3. Registry ETS Table Creation:

    • Initialize ETS table in start_link/1 not on first use
    • Use consistent table names across implementations
  4. Error Context Test Fix:

    • Update test to expect exception propagation
    • Add separate test for error enhancement without exceptions

2. Architectural Improvements

  1. Service Dependencies:

    • Implement service dependency declarations
    • Create startup ordering in Services.Supervisor
    • Add health checks for dependent services
  2. Test Mode Services:

    • Create Foundation.TestMode.Supervisor with all services
    • Ensure consistent service availability in tests
    • Add service reset capabilities for test isolation
  3. Migration Completion:

    • Set deadline for removing Process dictionary implementations
    • Add migration status dashboard
    • Complete transaction support in RegistryETS

3. Testing Strategy Improvements

  1. Service Availability Tests:

    • Add tests that verify behavior when services are unavailable
    • Test graceful degradation paths
    • Test service restart scenarios
  2. Implementation Parity Tests:

    • Create property-based tests that run against both implementations
    • Verify identical behavior for all operations
    • Test implementation switching during operation
  3. Integration Test Isolation:

    • Use unique ETS table names per test
    • Create test-specific FeatureFlags instances
    • Clear all global state between tests

Summary of Remaining Work

High Priority

  1. Fix SpanManager availability in span tests
  2. Ensure FeatureFlags service in all test setups
  3. Fix Registry ETS table initialization
  4. Update ErrorContext test expectations

Medium Priority

  1. Implement service dependency management
  2. Complete transaction support in RegistryETS
  3. Create comprehensive test helpers for service management
  4. Fix agent termination logging levels

Low Priority

  1. Remove Process dictionary implementations after migration
  2. Create migration status tracking
  3. Implement service health checks
  4. Add performance regression tests

Test Success Metrics

  • Current OTP Cleanup Tests: 84.7% (61/72 passing)
  • Affected Related Tests: ~15 failures across 7 test files
  • Total System Impact: ~95% of tests still passing

The remaining issues are primarily service availability and test setup problems rather than fundamental architectural flaws. With proper service management and test isolation, the OTP cleanup migration can achieve 100% test success.


Document Version: 1.0
Investigation Date: July 2, 2025
Next Steps: Implement high-priority fixes in order listed above