Foundation/ExUnit Race Condition Workaround
Overview
This document describes the race condition issue between Foundation telemetry handlers and ExUnit test cleanup, along with our comprehensive defensive workaround implemented in DSPEx.
The Problem
When using Foundation v0.1.3 with ExUnit tests, a race condition occurs during test cleanup:
- Test Completion: ExUnit tests complete successfully
- Cleanup Phase: ExUnit begins shutdown and cleanup procedures
- ETS Table Deletion: ExUnit.Server ETS table gets deleted
- Foundation Telemetry: Foundation telemetry handlers remain active and attempt to access the deleted ETS table
- Crash:
{:badarg, [{:ets, :take, [ExUnit.Server, #PID<...>]}]}
error occurs
This happens after tests pass but causes the overall test run to appear failed.
Root Causes
- Process Lifecycle Mismatch: Foundation telemetry processes outlive ExUnit cleanup
- ETS Access Violations: Telemetry handlers access deleted ETS tables
- Asynchronous Cleanup: Race between Foundation shutdown and ExUnit cleanup
Our Defensive Workaround
1. Enhanced Telemetry Handler Protection
def handle_dspex_event(event, measurements, metadata, config) do
# Enhanced defensive programming with comprehensive error handling
try do
if Foundation.available?() do
do_handle_dspex_event(event, measurements, metadata, config)
else
log_telemetry_skip(:foundation_unavailable, event, process_info)
:ok
end
rescue
ArgumentError ->
log_telemetry_skip(:ets_unavailable, event, process_info)
:ok
SystemLimitError ->
log_telemetry_skip(:system_limit, event, process_info)
:ok
UndefinedFunctionError ->
log_telemetry_skip(:undefined_function, event, process_info)
:ok
FunctionClauseError ->
log_telemetry_skip(:function_clause, event, process_info)
:ok
catch
:exit, {:noproc, _} ->
log_telemetry_skip(:process_dead, event, process_info)
:ok
:exit, {:badarg, _} ->
log_telemetry_skip(:ets_corruption, event, process_info)
:ok
:exit, {:normal, _} ->
log_telemetry_skip(:process_shutdown, event, process_info)
:ok
kind, reason ->
log_telemetry_skip({:unexpected_error, kind, reason}, event, process_info)
:ok
end
end
2. Graceful Shutdown Integration
defp setup_exunit_integration do
# Monitor ExUnit completion to prepare for shutdown
pid = self()
spawn(fn ->
ref = Process.monitor(ExUnit.Server)
receive do
{:DOWN, ^ref, :process, _pid, _reason} ->
send(pid, :prepare_for_shutdown)
end
end)
end
def handle_info(:prepare_for_shutdown, state) do
Logger.debug("DSPEx Telemetry: Preparing for graceful shutdown")
graceful_detach_handlers()
{:noreply, %{state | telemetry_active: false, handlers_attached: false}}
end
3. Foundation Availability Checking
Before executing any Foundation telemetry calls, we check:
if Foundation.available?() do
# Proceed with telemetry
else
# Skip gracefully
end
4. Comprehensive Error Categories Handled
- ETS Unavailability:
ArgumentError
when ETS tables are deleted - Process Death:
:exit, {:noproc, _}
when processes are gone - ETS Corruption:
:exit, {:badarg, _}
during table access - System Stress:
SystemLimitError
during high load - Function Unavailability:
UndefinedFunctionError
during shutdown - Contract Violations:
FunctionClauseError
from Foundation changes - Normal Shutdown:
:exit, {:normal, _}
during graceful cleanup
5. Debug Monitoring
Configurable debug logging helps track workaround effectiveness:
# config/test.exs
config :dspex,
telemetry_debug: false # Set to true to debug telemetry issues
When enabled, provides detailed logs:
DSPEx Telemetry Handler: Skipped event due to :ets_unavailable
Event: [:dspex, :predict, :start]
Process Info: %{pid: #PID<...>, node: :nonode@nohost, ...}
This is expected during test cleanup or Foundation shutdown.
Configuration
Test Environment
# config/test.exs
config :dspex,
telemetry_debug: false,
telemetry: %{
enabled: true,
defensive_mode: true,
graceful_shutdown: true
}
Production Environment
The same defensive patterns work in production, protecting against:
- Application shutdown race conditions
- High-load scenarios
- Network partitions affecting Foundation
Testing the Workaround
We’ve implemented comprehensive tests in test/unit/telemetry_race_condition_test.exs
:
- Shutdown Survival Test: Validates handlers survive Foundation shutdown
- ETS Unavailability Test: Simulates ETS table deletion scenarios
- Lifecycle Management Test: Tests state transitions during shutdown
- Stress Test: High-concurrency telemetry with simulated failures
- Availability Check Test: Validates Foundation availability checking
Run tests with:
mix test --only telemetry_race_condition
Status and Impact
✅ Immediate Protection
- Tests Run Clean: No more crashes after successful test completion
- Graceful Degradation: Telemetry fails safely when Foundation unavailable
- Production Stability: Same protections work in production environments
📊 Monitoring
- Debug Logging: Optional detailed logging for troubleshooting
- State Tracking: Telemetry service state management during lifecycle transitions
- Process Monitoring: Enhanced visibility into race conditions
🔄 Foundation Integration Maintained
- Full Feature Support: All Foundation APIs (Infrastructure, Telemetry, Events) remain functional
- Performance Tracking: Comprehensive telemetry continues when Foundation available
- Backward Compatibility: Workaround doesn’t break existing functionality
Reported Issue
This race condition has been reported to the Foundation team as GitHub Issue #4 with:
- Detailed reproduction steps
- Root cause analysis
- Recommended fixes for Foundation
- Success criteria for resolution
Future Improvements
When Foundation team addresses the root cause, we can:
- Reduce Defensive Code: Simplify handlers once race condition is fixed
- Remove Workarounds: Clean up defensive patterns when no longer needed
- Enhanced Integration: Use Foundation’s native ExUnit integration when available
Usage Guidelines
For DSPEx Developers
- Keep Defensive Patterns: Don’t remove defensive code until Foundation issue is resolved
- Monitor Debug Logs: Enable
telemetry_debug: true
when investigating issues - Test Race Conditions: Run telemetry race condition tests regularly
For Foundation Users
This workaround pattern can be applied to any Elixir application using Foundation:
- Wrap Telemetry Calls: Use try/catch around Foundation telemetry
- Check Availability: Verify Foundation.available?() before calls
- Handle ETS Errors: Gracefully handle
:badarg
and:noproc
exits - Monitor Cleanup: Watch for ExUnit.Server lifecycle if using tests
The defensive programming patterns here provide a robust template for Foundation integration in any application.