LATEST_ERRORS Implementation Plan
Executive Summary
Based on comprehensive analysis of LATEST_ERRORS_cat.md, lib_old error handling implementations, and the fuse circuit breaker library, this plan provides a strategic approach to fixing the 20 test failures and establishing a robust error handling foundation.
Priority Order: Error Handler standardization → SystemHealthSensor fixes → Semantic issues → Circuit Breaker decision point
Circuit Breaker Implementation Analysis
RECOMMENDATION: USE Fuse Library Directly
After thorough analysis of the fuse Erlang library, the recommendation is to USE the fuse library directly rather than FORK or COPY/ADAPT.
Why USE Fuse?
Battle-Tested: 8+ years in production systems, extensive QuickCheck testing
Performance: 2.1M queries/second, sub-microsecond latency
Robust Design:
- Comprehensive EQC testing found and fixed subtle race conditions
- Handles timer edge cases, administrative controls, and concurrent access
- Built specifically for high-throughput Erlang/Elixir systems
Complete Feature Set:
- Standard and fault injection fuses
- Administrative disable/enable
- Multiple monitoring backends (ETS, Prometheus, Folsom)
- Proper alarm handling with hysteresis
API Simplicity:
% Install circuit breaker fuse:install(database_fuse, {{standard, 5, 10000}, {reset, 60000}}). % Use circuit breaker case fuse:ask(database_fuse, sync) of ok -> perform_operation(); blown -> handle_circuit_open() end
Integration Strategy
- Phase 1: Replace Foundation.Infrastructure.CircuitBreaker with thin Elixir wrapper around fuse
- Phase 2: Integrate fuse telemetry with Foundation.Telemetry system
- Phase 3: Add Elixir-native configuration and supervision integration
Error Handling Analysis
Current lib_old Implementation Assessment
The lib_old error handling system shows superior design patterns compared to current Foundation implementation:
lib_old Strengths:
- Hierarchical Error Codes: Structured
{category, subcategory, error_type}
system - Rich Context: ErrorContext with breadcrumbs, correlation IDs, operation tracking
- Recovery Strategies: Built-in retry strategies and recovery action suggestions
- Telemetry Integration: Automatic metrics collection and duration tracking
Current Foundation Weaknesses:
- Inconsistent Wrapping: ErrorHandler sometimes wraps, sometimes propagates exceptions
- Poor Context: Limited error context compared to lib_old capabilities
- No Recovery Strategy: Missing retry and recovery guidance
RECOMMENDATION: Adopt lib_old Patterns with Modernization
Strategy: Use lib_old error handling as foundation, modernize for current architecture.
Implementation Plan
🔥 PHASE 1: Error Handler Standardization (Weeks 1-2)
Objective: Replace inconsistent Foundation.ErrorHandler with standardized system based on lib_old patterns.
1.1 Core Error Structure Implementation
# lib/foundation/error.ex - Based on lib_old/foundation/error.ex
defmodule Foundation.Error do
@enforce_keys [:code, :error_type, :message, :severity]
defstruct [
:code, # Hierarchical error code (1101, 2401, etc.)
:error_type, # Specific error atom (:config_not_found, :timeout)
:message, # Human readable message
:severity, # :low | :medium | :high | :critical
:context, # Rich context map
:correlation_id, # Cross-system tracing
:timestamp, # Error occurrence time
:stacktrace, # Formatted stacktrace
:category, # :config | :system | :data | :external
:subcategory, # :structure | :validation | :access | :runtime
:retry_strategy, # :no_retry | :immediate | :fixed_delay | :exponential_backoff
:recovery_actions # List of suggested recovery steps
]
# Error definitions with hierarchical codes
@error_definitions %{
{:config, :structure, :invalid_config_structure} => {1101, :high, "Configuration structure is invalid"},
{:system, :runtime, :internal_error} => {2401, :critical, "Internal system error"},
{:external, :timeout, :timeout} => {4301, :medium, "Operation timeout"},
# ... (port from lib_old)
}
end
1.2 Error Context Implementation
# lib/foundation/error_context.ex - Based on lib_old/foundation/error_context.ex
defmodule Foundation.ErrorContext do
defstruct [
:operation_id, # Unique operation identifier
:module, # Module where operation started
:function, # Function where operation started
:correlation_id, # Cross-system correlation
:start_time, # Operation start timestamp
:metadata, # Additional context data
:breadcrumbs, # Operation trail
:parent_context # Nested context support
]
def with_context(context, fun) do
# Execute function with error context tracking
# Automatic exception enhancement with context
end
def add_breadcrumb(context, module, function, metadata \\ %{}) do
# Track operation flow for debugging
end
end
1.3 Standardized Error Handling Patterns
- Consistent Wrapping: All Foundation functions return
{:ok, result}
or{:error, %Foundation.Error{}}
- Exception Enhancement: Raw exceptions automatically wrapped with context
- Telemetry Integration: Automatic error metrics collection
- Recovery Guidance: Built-in retry strategies and recovery actions
1.4 Migration Strategy
- Replace Foundation.ErrorHandler: Direct replacement with new Foundation.Error
- Update All Foundation Modules: Consistent error return patterns
- Test Migration: Update failing Foundation tests to expect new error format
- Backward Compatibility: Temporary shim for gradual migration
Expected Impact: Fixes FAILURE 1 (Foundation Configuration Error Handling)
🔧 PHASE 2: SystemHealthSensor Fixes (Weeks 3-4)
Objective: Fix the 15/20 test failures caused by SystemHealthSensor malfunction.
2.1 Root Cause Analysis
Current failures stem from:
- Rigid Data Assumptions:
get_average_cpu_utilization/1
expects specific tuple format - Signal Format Inconsistency: Mixed
%Signal{}
vs{:ok, %Signal{}}
patterns - No Error Recovery: Crashes on unexpected data instead of graceful degradation
2.2 Robust Input Validation
# lib/jido_system/sensors/system_health_sensor.ex - Enhanced
defmodule JidoSystem.Sensors.SystemHealthSensor do
def get_average_cpu_utilization(scheduler_data) do
case validate_scheduler_data(scheduler_data) do
{:ok, validated_data} ->
average = calculate_average(validated_data)
{:ok, average}
{:error, reason} ->
Logger.warning("Invalid scheduler data: #{inspect(reason)}")
{:ok, 0.0} # Graceful degradation
end
end
defp validate_scheduler_data(data) when is_list(data) do
case Enum.all?(data, &valid_scheduler_tuple?/1) do
true -> {:ok, data}
false -> {:error, :invalid_tuple_format}
end
end
defp validate_scheduler_data(_), do: {:error, :not_list}
defp valid_scheduler_tuple?({_scheduler_id, usage}) when is_number(usage), do: true
defp valid_scheduler_tuple?(_), do: false
end
2.3 Consistent Signal Handling
# Standardize signal wrapping patterns
defp emit_signal(data, type) do
signal = %Jido.Signal{
data: data,
type: type,
timestamp: DateTime.utc_now(),
metadata: %{sensor: __MODULE__}
}
# Always return consistent format
{:ok, signal}
end
defp handle_error(error) do
error_signal = %Jido.Signal{
data: %{error: error},
type: :error,
timestamp: DateTime.utc_now(),
metadata: %{sensor: __MODULE__}
}
{:ok, error_signal} # Consistent error signal format
end
2.4 Defensive Programming Patterns
- Graceful Degradation: Return safe defaults on invalid data
- Comprehensive Logging: Log issues without crashing
- Error Signal Format: Consistent error signal structure
- Input Validation: Validate all inputs before processing
Expected Impact: Fixes FAILURES 6-20 (SystemHealthSensor - 75% of all failures)
⚡ PHASE 3: Semantic Issues Resolution (Week 5)
Objective: Address remaining semantic issues and code quality problems.
3.1 Jido Agent Registration Enhancement
# Fix agent registration to handle dead processes
defp register_agent_with_retry(agent_info, max_retries \\ 3) do
case Foundation.register(agent_info.key, agent_info.pid, agent_info.metadata) do
:ok -> :ok
{:error, %Foundation.Error{error_type: :process_not_alive}} ->
if max_retries > 0 do
Process.sleep(100) # Brief delay
register_agent_with_retry(agent_info, max_retries - 1)
else
Logger.warning("Failed to register agent after retries: #{inspect(agent_info)}")
{:error, :registration_failed}
end
error -> error
end
end
3.2 Telemetry Handler Optimization
# Convert anonymous functions to named module functions
defmodule Foundation.TelemetryHandlers do
def handle_jido_events(event, measurements, metadata, config) do
# Named function for better performance
Logger.info("Jido event: #{inspect(event)}",
measurements: measurements,
metadata: metadata
)
end
end
# In tests, use named handler
:telemetry.attach("test-jido-events",
[:jido, :agent, :event],
&Foundation.TelemetryHandlers.handle_jido_events/4,
%{}
)
3.3 Code Quality Fixes
- Remove Unused Variables: Prefix with underscore or remove entirely
- Fix Unreachable Clauses: Correct pattern matching order
- Update Test Expectations: Align with new error formats
Expected Impact: Fixes remaining semantic issues, improves code quality
🚧 PHASE 4: Circuit Breaker Implementation Decision Point
At this point, pause for decision on circuit breaker strategy.
Option A: Use Fuse Library (RECOMMENDED)
- Pros: Battle-tested, high performance, comprehensive features
- Cons: Erlang dependency, learning curve for team
- Implementation: 1-2 weeks for Elixir wrapper and integration
Option B: Reimplement Based on Fuse
- Pros: Pure Elixir, full control, integrated with Foundation
- Cons: 4-6 weeks development, need to reimplement battle-tested logic
- Risk: Subtle race conditions, timer handling edge cases
Option C: Minimal Circuit Breaker
- Pros: Quick implementation, minimal dependencies
- Cons: Limited features, potential production issues
- Scope: Basic state machine without advanced features
RECOMMENDATION: Option A (Use Fuse)
The analysis strongly favors using the fuse library directly due to its proven reliability and comprehensive testing.
Success Metrics
Phase 1 Success Criteria
- All Foundation.ErrorHandler tests pass with new error format
- Foundation configuration error handling returns proper
{:error, %Foundation.Error{}}
tuples - Error context tracking works across module boundaries
- FAILURE 1 resolved
Phase 2 Success Criteria
- All 15 SystemHealthSensor tests pass
- Sensor handles invalid data gracefully without crashes
- Consistent signal format across all sensor operations
- FAILURES 6-20 resolved
Phase 3 Success Criteria
- Agent registration handles dead processes gracefully
- Telemetry handlers use named functions
- All compiler warnings resolved
- Remaining semantic issues addressed
Overall Success Target
- 20/20 test failures resolved
- Error handling standardized across Foundation
- Circuit breaker strategy decided and implementation path clear
- Code quality improved with comprehensive error context
Risk Assessment
High Risk Items
- Error Format Migration: Changing error formats may break dependent code
- SystemHealthSensor Refactor: Critical monitoring component, must maintain functionality
- Circuit Breaker Integration: Fuse library integration complexity
Mitigation Strategies
- Gradual Migration: Maintain backward compatibility during transition
- Comprehensive Testing: Test each phase thoroughly before proceeding
- Documentation: Document new error patterns and migration guide
- Rollback Plan: Ability to revert changes if issues arise
Timeline Risk
- Conservative Estimate: 6-8 weeks total
- Optimistic Estimate: 4-5 weeks total
- Contingency: Additional 2 weeks for unforeseen issues
Conclusion
This plan provides a systematic approach to resolving the Foundation’s structural issues while establishing robust error handling patterns. The decision to use the fuse library represents a strategic choice favoring proven reliability over custom implementation.
Key Success Factors:
- Standardized Error Handling: Foundation.Error based on lib_old patterns
- Robust Sensor Design: SystemHealthSensor with defensive programming
- Battle-Tested Circuit Breaker: Fuse library integration
- Quality Code Patterns: Comprehensive error context and recovery
Next Action: Begin Phase 1 (Error Handler Standardization) after plan approval.