← Back to Foundation

P23aSignalFixes

Documentation for P23aSignalFixes from the Foundation repository.

P23a Signal Routing Integration Issues Analysis

🚨 CRITICAL ISSUE: ANTI-PATTERN SLEEP USAGE

IMMEDIATE PRIORITY: The Signal Routing tests are using forbidden :timer.sleep() calls which violate our strict code quality standards per docs20250627/042_SLEEP.md. This is the root cause of all failures.

Executive Summary

After successful Signal Bus service integration that fixed the core signal emission problems (13 of 19 test failures), 4 remaining Signal Routing test failures are caused by architectural anti-patterns using :timer.sleep(50) instead of proper OTP coordination mechanisms.

Analysis Overview

Current Status: 330 tests, 4 signal routing failures remaining
Investigation Date: 2025-06-29
Context: Post-Signal Bus integration, SLEEP ANTI-PATTERN VIOLATIONS

Root Cause Analysis

CRITICAL: Sleep Anti-Pattern Violations

Found in test/jido_foundation/signal_routing_test.exs:

  • Line 223: :timer.sleep(50) - routes signals to subscribed handlers by type
  • Line 276: :timer.sleep(50) - supports dynamic subscription management
  • Line 294: :timer.sleep(50) - emits routing telemetry events
  • Line 412: :timer.sleep(50) - supports wildcard signal subscriptions

Why This Is Wrong (per 042_SLEEP.md):

  1. Creates Race Conditions - 50ms might work on fast dev machines but fail on loaded CI/production
  2. Inefficient - Either too slow (wastes time) or too fast (breaks randomly)
  3. Hides Design Flaws - Indicates SignalRouter isn’t providing clear completion signals

Core Issue: Missing Observable Effects Pattern

The tests use sleep instead of waiting for observable effects:

# WRONG (Current):
Bridge.emit_signal(agent, signal)
:timer.sleep(50)  # ❌ FORBIDDEN
assert length(handler_signals) == 1

# RIGHT (Should be):
Bridge.emit_signal(agent, signal)
assert_receive {:routing_complete, signal_type}  # βœ… OBSERVABLE EFFECT
assert length(handler_signals) == 1
# Expected Flow:
Bridge.emit_signal(agent, signal)
  ↓
Jido.Signal.Bus.publish(bus, [signal]) 
  ↓
:telemetry.execute([:jido, :signal, :emitted], measurements, metadata)
  ↓  
SignalRouter telemetry handler receives event
  ↓
SignalRouter routes to subscribed handlers  
  ↓
SignalHandler receives {:routed_signal, ...} message
  ↓
Test checks SignalHandler.get_received_signals() β†’ expects signals

Actual Behavior: Race condition where test checks for signals before telemetry routing completes.

Failure Pattern Analysis

1. Handler Signal Reception Failures (3 tests)

Tests Affected:

  • routes signals to subscribed handlers by type - expects length(handler1_signals) == 1, gets 0
  • supports wildcard signal subscriptions - expects length(wildcard_signals) == 2, gets 0
  • supports dynamic subscription management - expects length(signals) == 1, gets 0

Error Pattern:

# All tests show:
assert length(handler_signals) == expected_count
# left: 0, right: expected_count

Root Cause:

  • Telemetry processing async delay - :telemetry.execute() triggers handler asynchronously
  • Test timing assumptions - 50ms sleep insufficient for telemetryβ†’routingβ†’delivery chain
  • SignalRouter cast processing - GenServer.cast() for routing adds additional async delay

2. Routing Telemetry Event Missing

Test Affected:

  • emits routing telemetry events - expects [:jido, :signal, :routed] telemetry event

Error Pattern:

assert_receive {:telemetry, [:jido, :signal, :routed], measurements, metadata}
# Results in: Assertion failed, no matching message after 100ms

Root Cause:

  • Nested telemetry dependency - :routed events depend on successful :emitted event processing
  • Handler subscription failure - If SignalRouter doesn’t receive initial :emitted event, no :routed event emitted
  • State isolation issues - Cross-test telemetry handler pollution

Technical Deep Dive

Signal Bus Integration Working Correctly

βœ… Signal Bus Publishing: Logs show successful signal publishing:

[info] Bus foundation_signal_bus: Publishing 1 signal(s) of types: ["error.validation"]
[info] Bus foundation_signal_bus: Successfully published 1 signal(s)

βœ… Bridge Telemetry Emission: Code correctly emits telemetry:

:telemetry.execute(
  [:jido, :signal, :emitted],
  %{signal_id: telemetry_signal_id},
  %{
    agent_id: agent,
    signal_type: signal_data.type,  # ← Correct metadata key
    signal_source: signal_data.source,
    framework: :jido,
    timestamp: System.system_time(:microsecond)
  }
)

SignalRouter Telemetry Handler Issues

Current Implementation (test/jido_foundation/signal_routing_test.exs:26-30):

fn event, measurements, metadata, config ->
  GenServer.cast(__MODULE__, {:route_signal, event, measurements, metadata, config})
end

Problems Identified:

  1. Triple Async Chain:

    • :telemetry.execute() β†’ async handler call
    • GenServer.cast() β†’ async message to SignalRouter
    • send(handler_pid, {:routed_signal, ...}) β†’ async message to handlers
  2. No Synchronization Points: Tests assume synchronous completion but implementation is fully async

  3. State Management: SignalRouter state between tests may be corrupted

Individual vs Suite Test Results

Key Observation: Running individual test passes, but suite execution fails.

Indicates:

  • Test isolation problems - SignalRouter state leakage between tests
  • Telemetry handler pollution - Multiple handler attachments or detachments failing
  • Process lifecycle issues - SignalRouter or handlers not properly cleaned up

πŸ”§ REQUIRED FIXES: Remove Sleep Anti-Patterns

Immediate Fix (CRITICAL PRIORITY)

Replace ALL :timer.sleep() calls with proper OTP observable effects per 042_SLEEP.md

Solution 1: Use assert_receive for Telemetry Events

# WRONG (Current):
Bridge.emit_signal(agent, signal)
:timer.sleep(50)  # ❌ FORBIDDEN ANTI-PATTERN
assert length(handler_signals) == 1

# RIGHT (Fix):
test_pid = self()
ref = make_ref()

:telemetry.attach("test-routing-completion", [:jido, :signal, :routed], 
  fn _event, measurements, metadata, _config ->
    send(test_pid, {ref, :routing_complete, metadata.signal_type, measurements.handlers_count})
  end, nil)

Bridge.emit_signal(agent, signal)
assert_receive {^ref, :routing_complete, signal_type, handlers_count}  # βœ… OBSERVABLE EFFECT
assert length(handler_signals) == 1

:telemetry.detach("test-routing-completion")

Solution 2: Use Direct Handler Message Monitoring

# Monitor the handler directly for signal reception
def wait_for_handler_signals(handler_pid, expected_count) do
  current_signals = SignalHandler.get_received_signals(handler_pid)
  
  if length(current_signals) >= expected_count do
    :ok
  else
    # Wait for the handler to receive a signal
    receive do
      {:DOWN, _ref, :process, ^handler_pid, _reason} ->
        {:error, :handler_died}
    after 1000 ->
        # Check again - this is polling but acceptable in the wait_for pattern
        case SignalHandler.get_received_signals(handler_pid) do
          signals when length(signals) >= expected_count -> :ok
          _ -> {:error, :timeout}
        end
    end
  end
end

2. Fix SignalRouter State Isolation

Add proper test setup/teardown:

setup do
  # Ensure clean SignalRouter state
  if Process.whereis(SignalRouter) do
    GenServer.stop(SignalRouter)
  end
  
  {:ok, router_pid} = SignalRouter.start_link()
  
  on_exit(fn ->
    if Process.alive?(router_pid) do
      GenServer.stop(router_pid)
    end
    
    # Clean up any leaked telemetry handlers
    try do
      :telemetry.detach("jido-signal-router")
    catch
      _, _ -> :ok
    end
  end)
  
  {:ok, router: router_pid}
end

3. Add Telemetry Event Debugging

Add debug instrumentation to understand event flow:

# In SignalRouter init
:telemetry.attach_many(
  "jido-signal-router-debug",
  [[:jido, :signal, :emitted], [:jido, :signal, :routed]],
  fn event, measurements, metadata, config ->
    IO.puts("DEBUG: Telemetry event: #{inspect(event)}, metadata: #{inspect(metadata)}")
    GenServer.cast(__MODULE__, {:route_signal, event, measurements, metadata, config})
  end,
  %{}
)

Medium-Term Architecture Improvements

1. Synchronous Signal Routing Option

Add synchronous routing mode for tests:

def emit_signal_sync(agent, signal, opts \\ []) do
  result = emit_signal(agent, signal, opts)
  
  if Keyword.get(opts, :wait_for_routing, false) do
    signal_type = extract_signal_type(signal)
    wait_for_routing_completion(signal_type, opts[:timeout] || 1000)
  end
  
  result
end

2. Enhanced SignalRouter with Confirmation

def handle_cast({:route_signal, event, measurements, metadata, config}, state) do
  # ... existing routing logic ...
  
  # Send routing confirmation if requested
  if confirm_pid = metadata[:routing_confirm_pid] do
    send(confirm_pid, {:routing_complete, signal_type, length(all_handlers)})
  end
  
  {:noreply, state}
end

Long-Term Solutions

1. Signal Bus Native Routing Integration

Integrate routing directly into Foundation Signal Bus service:

defmodule Foundation.Services.SignalBus do
  # Add subscription management
  def subscribe_handler(bus_name \\ :foundation_signal_bus, signal_type, handler_pid) do
    GenServer.call(__MODULE__, {:subscribe, bus_name, signal_type, handler_pid})
  end
  
  # Route signals immediately on publish
  defp route_published_signals(published_signals, state) do
    Enum.each(published_signals, fn recorded_signal ->
      route_signal_to_subscribers(recorded_signal.signal, state.subscriptions)
    end)
  end
end

2. Unified Event System

Create unified event routing that handles both Signal Bus events and traditional telemetry:

defmodule Foundation.Services.EventRouter do
  # Bridge between Jido Signal Bus and Foundation event system
  # Provides synchronous routing guarantees
  # Handles subscription management across multiple event sources
end

Test-Specific Issues

Test 1: routes signals to subscribed handlers by type

  • Issue: Basic signal routing timing
  • Fix: Add synchronization before signal count assertion

Test 2: supports wildcard signal subscriptions

  • Issue: Wildcard pattern matching + timing
  • Fix: Verify wildcard pattern matching logic + add synchronization

Test 3: emits routing telemetry events

  • Issue: Nested telemetry dependency chain
  • Fix: Ensure :emitted event successfully triggers :routed event

Test 4: supports dynamic subscription management

  • Issue: Subscription state management across test lifecycle
  • Fix: Add subscription state verification + proper cleanup

Conclusion

The Signal Routing failures are architectural timing issues rather than fundamental integration problems. The Signal Bus integration is working correctly, but the asynchronous telemetry-based routing system needs:

  1. Proper test synchronization - Replace naive sleeps with event-driven waiting
  2. Better state isolation - Ensure clean SignalRouter state between tests
  3. Debugging instrumentation - Add visibility into telemetry event flow
  4. Enhanced error handling - Detect and report telemetry handler failures

These are solvable implementation issues that don’t require fundamental architectural changes. The Signal Bus foundation is solid; the routing layer needs timing refinement.

βœ… SUCCESS: Sleep Anti-Patterns Successfully Eliminated!

COMPLETED: All Sleep Calls Removed

Successfully removed all 4 :timer.sleep(50) calls from signal_routing_test.exs and replaced with proper OTP coordination patterns:

  1. βœ… Line 259 - Replaced with assert_receive {^routing_ref, :routing_complete, "error.validation", 2}
  2. βœ… Line 335 - Replaced with assert_receive {^routing_ref, :routing_complete, "task.started", 1}
  3. βœ… Line 398 - Replaced with assert_receive {:telemetry, [:jido, :signal, :routed], measurements, metadata}
  4. βœ… Line 495 - Replaced with assert_receive {^routing_ref, :routing_complete, "error.validation", 2}

IMPACT: Test Improvements Achieved

  • βœ… Signal Routing tests pass when run individually (5/5 tests passing)
  • βœ… Reduced overall test failures from 6 to 5 (16.7% improvement)
  • βœ… Eliminated all sleep anti-pattern violations per 042_SLEEP.md standards
  • βœ… Implemented proper telemetry-based coordination using assert_receive

REMAINING CHALLENGE: Test Isolation Issues

Current Status: Signal Routing tests fail in full suite context due to test isolation problems:

- Individual run: 5/5 tests passing βœ…
- Full suite run: 4/5 tests failing (timing out) ❌

Root Cause: Other tests are polluting the telemetry handler state or interfering with the shared :foundation_signal_bus process.

NEXT STEPS: Test Isolation Fixes

To achieve full Signal Routing test reliability:

  1. Test Cleanup Enhancement - Ensure complete telemetry handler cleanup between tests
  2. Signal Bus Isolation - Use test-specific bus instances instead of shared bus
  3. Handler State Reset - Clear SignalRouter subscriptions between tests
  4. Telemetry Event Debugging - Add instrumentation to identify handler pollution

ARCHITECTURAL SUCCESS

The sleep elimination work successfully demonstrates:

  • βœ… Proper OTP coordination patterns using observable effects
  • βœ… Deterministic test behavior when run in isolation
  • βœ… Foundation Signal Bus integration working correctly
  • βœ… Telemetry event flow functioning as designed

The core architecture is sound - the remaining issues are test isolation refinements, not fundamental design flaws.