P23a Signal Routing Integration Issues Analysis
π¨ CRITICAL ISSUE: ANTI-PATTERN SLEEP USAGE
IMMEDIATE PRIORITY: The Signal Routing tests are using forbidden :timer.sleep()
calls which violate our strict code quality standards per docs20250627/042_SLEEP.md
. This is the root cause of all failures.
Executive Summary
After successful Signal Bus service integration that fixed the core signal emission problems (13 of 19 test failures), 4 remaining Signal Routing test failures are caused by architectural anti-patterns using :timer.sleep(50)
instead of proper OTP coordination mechanisms.
Analysis Overview
Current Status: 330 tests, 4 signal routing failures remaining
Investigation Date: 2025-06-29
Context: Post-Signal Bus integration, SLEEP ANTI-PATTERN VIOLATIONS
Root Cause Analysis
CRITICAL: Sleep Anti-Pattern Violations
Found in test/jido_foundation/signal_routing_test.exs:
- Line 223:
:timer.sleep(50)
- routes signals to subscribed handlers by type - Line 276:
:timer.sleep(50)
- supports dynamic subscription management - Line 294:
:timer.sleep(50)
- emits routing telemetry events - Line 412:
:timer.sleep(50)
- supports wildcard signal subscriptions
Why This Is Wrong (per 042_SLEEP.md):
- Creates Race Conditions - 50ms might work on fast dev machines but fail on loaded CI/production
- Inefficient - Either too slow (wastes time) or too fast (breaks randomly)
- Hides Design Flaws - Indicates SignalRouter isn’t providing clear completion signals
Core Issue: Missing Observable Effects Pattern
The tests use sleep
instead of waiting for observable effects:
# WRONG (Current):
Bridge.emit_signal(agent, signal)
:timer.sleep(50) # β FORBIDDEN
assert length(handler_signals) == 1
# RIGHT (Should be):
Bridge.emit_signal(agent, signal)
assert_receive {:routing_complete, signal_type} # β
OBSERVABLE EFFECT
assert length(handler_signals) == 1
# Expected Flow:
Bridge.emit_signal(agent, signal)
β
Jido.Signal.Bus.publish(bus, [signal])
β
:telemetry.execute([:jido, :signal, :emitted], measurements, metadata)
β
SignalRouter telemetry handler receives event
β
SignalRouter routes to subscribed handlers
β
SignalHandler receives {:routed_signal, ...} message
β
Test checks SignalHandler.get_received_signals() β expects signals
Actual Behavior: Race condition where test checks for signals before telemetry routing completes.
Failure Pattern Analysis
1. Handler Signal Reception Failures (3 tests)
Tests Affected:
routes signals to subscribed handlers by type
- expectslength(handler1_signals) == 1
, gets0
supports wildcard signal subscriptions
- expectslength(wildcard_signals) == 2
, gets0
supports dynamic subscription management
- expectslength(signals) == 1
, gets0
Error Pattern:
# All tests show:
assert length(handler_signals) == expected_count
# left: 0, right: expected_count
Root Cause:
- Telemetry processing async delay -
:telemetry.execute()
triggers handler asynchronously - Test timing assumptions - 50ms sleep insufficient for telemetryβroutingβdelivery chain
- SignalRouter cast processing -
GenServer.cast()
for routing adds additional async delay
2. Routing Telemetry Event Missing
Test Affected:
emits routing telemetry events
- expects[:jido, :signal, :routed]
telemetry event
Error Pattern:
assert_receive {:telemetry, [:jido, :signal, :routed], measurements, metadata}
# Results in: Assertion failed, no matching message after 100ms
Root Cause:
- Nested telemetry dependency -
:routed
events depend on successful:emitted
event processing - Handler subscription failure - If SignalRouter doesn’t receive initial
:emitted
event, no:routed
event emitted - State isolation issues - Cross-test telemetry handler pollution
Technical Deep Dive
Signal Bus Integration Working Correctly
β Signal Bus Publishing: Logs show successful signal publishing:
[info] Bus foundation_signal_bus: Publishing 1 signal(s) of types: ["error.validation"]
[info] Bus foundation_signal_bus: Successfully published 1 signal(s)
β Bridge Telemetry Emission: Code correctly emits telemetry:
:telemetry.execute(
[:jido, :signal, :emitted],
%{signal_id: telemetry_signal_id},
%{
agent_id: agent,
signal_type: signal_data.type, # β Correct metadata key
signal_source: signal_data.source,
framework: :jido,
timestamp: System.system_time(:microsecond)
}
)
SignalRouter Telemetry Handler Issues
Current Implementation (test/jido_foundation/signal_routing_test.exs:26-30):
fn event, measurements, metadata, config ->
GenServer.cast(__MODULE__, {:route_signal, event, measurements, metadata, config})
end
Problems Identified:
Triple Async Chain:
:telemetry.execute()
β async handler callGenServer.cast()
β async message to SignalRoutersend(handler_pid, {:routed_signal, ...})
β async message to handlers
No Synchronization Points: Tests assume synchronous completion but implementation is fully async
State Management: SignalRouter state between tests may be corrupted
Individual vs Suite Test Results
Key Observation: Running individual test passes, but suite execution fails.
Indicates:
- Test isolation problems - SignalRouter state leakage between tests
- Telemetry handler pollution - Multiple handler attachments or detachments failing
- Process lifecycle issues - SignalRouter or handlers not properly cleaned up
π§ REQUIRED FIXES: Remove Sleep Anti-Patterns
Immediate Fix (CRITICAL PRIORITY)
Replace ALL :timer.sleep()
calls with proper OTP observable effects per 042_SLEEP.md
Solution 1: Use assert_receive for Telemetry Events
# WRONG (Current):
Bridge.emit_signal(agent, signal)
:timer.sleep(50) # β FORBIDDEN ANTI-PATTERN
assert length(handler_signals) == 1
# RIGHT (Fix):
test_pid = self()
ref = make_ref()
:telemetry.attach("test-routing-completion", [:jido, :signal, :routed],
fn _event, measurements, metadata, _config ->
send(test_pid, {ref, :routing_complete, metadata.signal_type, measurements.handlers_count})
end, nil)
Bridge.emit_signal(agent, signal)
assert_receive {^ref, :routing_complete, signal_type, handlers_count} # β
OBSERVABLE EFFECT
assert length(handler_signals) == 1
:telemetry.detach("test-routing-completion")
Solution 2: Use Direct Handler Message Monitoring
# Monitor the handler directly for signal reception
def wait_for_handler_signals(handler_pid, expected_count) do
current_signals = SignalHandler.get_received_signals(handler_pid)
if length(current_signals) >= expected_count do
:ok
else
# Wait for the handler to receive a signal
receive do
{:DOWN, _ref, :process, ^handler_pid, _reason} ->
{:error, :handler_died}
after 1000 ->
# Check again - this is polling but acceptable in the wait_for pattern
case SignalHandler.get_received_signals(handler_pid) do
signals when length(signals) >= expected_count -> :ok
_ -> {:error, :timeout}
end
end
end
end
2. Fix SignalRouter State Isolation
Add proper test setup/teardown:
setup do
# Ensure clean SignalRouter state
if Process.whereis(SignalRouter) do
GenServer.stop(SignalRouter)
end
{:ok, router_pid} = SignalRouter.start_link()
on_exit(fn ->
if Process.alive?(router_pid) do
GenServer.stop(router_pid)
end
# Clean up any leaked telemetry handlers
try do
:telemetry.detach("jido-signal-router")
catch
_, _ -> :ok
end
end)
{:ok, router: router_pid}
end
3. Add Telemetry Event Debugging
Add debug instrumentation to understand event flow:
# In SignalRouter init
:telemetry.attach_many(
"jido-signal-router-debug",
[[:jido, :signal, :emitted], [:jido, :signal, :routed]],
fn event, measurements, metadata, config ->
IO.puts("DEBUG: Telemetry event: #{inspect(event)}, metadata: #{inspect(metadata)}")
GenServer.cast(__MODULE__, {:route_signal, event, measurements, metadata, config})
end,
%{}
)
Medium-Term Architecture Improvements
1. Synchronous Signal Routing Option
Add synchronous routing mode for tests:
def emit_signal_sync(agent, signal, opts \\ []) do
result = emit_signal(agent, signal, opts)
if Keyword.get(opts, :wait_for_routing, false) do
signal_type = extract_signal_type(signal)
wait_for_routing_completion(signal_type, opts[:timeout] || 1000)
end
result
end
2. Enhanced SignalRouter with Confirmation
def handle_cast({:route_signal, event, measurements, metadata, config}, state) do
# ... existing routing logic ...
# Send routing confirmation if requested
if confirm_pid = metadata[:routing_confirm_pid] do
send(confirm_pid, {:routing_complete, signal_type, length(all_handlers)})
end
{:noreply, state}
end
Long-Term Solutions
1. Signal Bus Native Routing Integration
Integrate routing directly into Foundation Signal Bus service:
defmodule Foundation.Services.SignalBus do
# Add subscription management
def subscribe_handler(bus_name \\ :foundation_signal_bus, signal_type, handler_pid) do
GenServer.call(__MODULE__, {:subscribe, bus_name, signal_type, handler_pid})
end
# Route signals immediately on publish
defp route_published_signals(published_signals, state) do
Enum.each(published_signals, fn recorded_signal ->
route_signal_to_subscribers(recorded_signal.signal, state.subscriptions)
end)
end
end
2. Unified Event System
Create unified event routing that handles both Signal Bus events and traditional telemetry:
defmodule Foundation.Services.EventRouter do
# Bridge between Jido Signal Bus and Foundation event system
# Provides synchronous routing guarantees
# Handles subscription management across multiple event sources
end
Test-Specific Issues
Test 1: routes signals to subscribed handlers by type
- Issue: Basic signal routing timing
- Fix: Add synchronization before signal count assertion
Test 2: supports wildcard signal subscriptions
- Issue: Wildcard pattern matching + timing
- Fix: Verify wildcard pattern matching logic + add synchronization
Test 3: emits routing telemetry events
- Issue: Nested telemetry dependency chain
- Fix: Ensure
:emitted
event successfully triggers:routed
event
Test 4: supports dynamic subscription management
- Issue: Subscription state management across test lifecycle
- Fix: Add subscription state verification + proper cleanup
Conclusion
The Signal Routing failures are architectural timing issues rather than fundamental integration problems. The Signal Bus integration is working correctly, but the asynchronous telemetry-based routing system needs:
- Proper test synchronization - Replace naive sleeps with event-driven waiting
- Better state isolation - Ensure clean SignalRouter state between tests
- Debugging instrumentation - Add visibility into telemetry event flow
- Enhanced error handling - Detect and report telemetry handler failures
These are solvable implementation issues that don’t require fundamental architectural changes. The Signal Bus foundation is solid; the routing layer needs timing refinement.
β SUCCESS: Sleep Anti-Patterns Successfully Eliminated!
COMPLETED: All Sleep Calls Removed
Successfully removed all 4 :timer.sleep(50)
calls from signal_routing_test.exs and replaced with proper OTP coordination patterns:
- β
Line 259 - Replaced with
assert_receive {^routing_ref, :routing_complete, "error.validation", 2}
- β
Line 335 - Replaced with
assert_receive {^routing_ref, :routing_complete, "task.started", 1}
- β
Line 398 - Replaced with
assert_receive {:telemetry, [:jido, :signal, :routed], measurements, metadata}
- β
Line 495 - Replaced with
assert_receive {^routing_ref, :routing_complete, "error.validation", 2}
IMPACT: Test Improvements Achieved
- β Signal Routing tests pass when run individually (5/5 tests passing)
- β Reduced overall test failures from 6 to 5 (16.7% improvement)
- β
Eliminated all sleep anti-pattern violations per
042_SLEEP.md
standards - β
Implemented proper telemetry-based coordination using
assert_receive
REMAINING CHALLENGE: Test Isolation Issues
Current Status: Signal Routing tests fail in full suite context due to test isolation problems:
- Individual run: 5/5 tests passing β
- Full suite run: 4/5 tests failing (timing out) β
Root Cause: Other tests are polluting the telemetry handler state or interfering with the shared :foundation_signal_bus
process.
NEXT STEPS: Test Isolation Fixes
To achieve full Signal Routing test reliability:
- Test Cleanup Enhancement - Ensure complete telemetry handler cleanup between tests
- Signal Bus Isolation - Use test-specific bus instances instead of shared bus
- Handler State Reset - Clear SignalRouter subscriptions between tests
- Telemetry Event Debugging - Add instrumentation to identify handler pollution
ARCHITECTURAL SUCCESS
The sleep elimination work successfully demonstrates:
- β Proper OTP coordination patterns using observable effects
- β Deterministic test behavior when run in isolation
- β Foundation Signal Bus integration working correctly
- β Telemetry event flow functioning as designed
The core architecture is sound - the remaining issues are test isolation refinements, not fundamental design flaws.