← Back to Foundation

SUPERVISION STRAT ERROR

Documentation for SUPERVISION_STRAT_ERROR from the Foundation repository.

Supervision Strategy Error Analysis

Executive Summary

The error counting issue in TaskAgent reveals fundamental supervision strategy problems in our Foundation/JidoSystem integration. We’re trying to manage business logic state (error counting) through OTP supervision mechanisms, which violates separation of concerns and leads to complex, unreliable solutions.

Root Cause Analysis

The Fundamental Problem

We’re conflating operational supervision with business logic state management.

  • OTP Supervision: Designed for process lifecycle management, crash recovery, and fault tolerance
  • Business Logic State: Application-specific data like error counts, metrics, and user state
  • Our Error: Trying to use supervision callbacks to persist business state across process restarts

Why Current Approach Fails

  1. Wrong Abstraction Level: Error counting is business logic, not operational supervision
  2. State Persistence Confusion: OTP supervision assumes stateless process recovery
  3. Callback Misuse: Using on_error for state mutations instead of telemetry/cleanup
  4. Framework Fighting: Working against Jido’s intended architecture instead of with it

Jido Framework Supervision Architecture

Jido’s Design Philosophy

Jido Application Supervisor
├── Dynamic Agent Supervisor (DynamicSupervisor)
│   ├── Agent.Server (GenServer) - Process lifecycle
│   │   ├── Agent State (ephemeral) - Business logic
│   │   └── Child Processes (supervised)
│   └── Registry (ETS) - Discovery
└── Task Supervisor - Async operations

Key Principles:

  • Process supervision handles crashes and restarts
  • Agent state is ephemeral and rebuilt on restart
  • Business logic state should be externalized if persistence is needed
  • Callbacks are for cleanup and telemetry, not state mutations

Jido State Management Flow

# CORRECT: State transitions through validated pathways
Agent State -> Action -> Validation -> New State -> Persistence (if needed)

# INCORRECT: State mutations through error callbacks
Error -> on_error -> State Mutation -> Framework Confusion

Foundation Supervision Architecture

Current Foundation Structure

Foundation Application Supervisor (:one_for_one)
├── Foundation.Services.Supervisor (:one_for_one)
│   ├── RetryService
│   ├── ConnectionManager
│   ├── RateLimiter
│   └── SignalBus
├── MABEAM.Supervisor (:rest_for_one)
│   ├── AgentRegistry
│   ├── AgentCoordination
│   └── AgentInfrastructure
└── JidoSystem (NO SUPERVISION) ⚠️
    └── Manual agent creation

Critical Gap: JidoSystem agents are NOT supervised by Foundation, creating orphaned processes.

The Error Counting Problem - Proper Solution

Where Error Counting Should Live

NOT in OTP supervision callbacks - these are for process management, not business logic.

YES in agent business logic with proper persistence strategy:

defmodule TaskAgent do
  # Business logic state - ephemeral, rebuilt on restart
  defstruct error_count: 0, status: :idle
  
  # Error counting through normal action flow
  def handle_action_error(agent, error) do
    new_count = agent.state.error_count + 1
    new_state = %{agent.state | error_count: new_count}
    
    # Persist if needed
    if should_persist_errors?() do
      ErrorStore.record_error(agent.id, error)
    end
    
    # Apply business rules
    if new_count >= 10 do
      %{new_state | status: :paused}
    else
      new_state
    end
  end
end

Proper Supervision Strategy for Error Management

# 1. Business Logic Layer (Agent State)
- Error counting and thresholds
- Automatic pausing/resuming
- Immediate error handling

# 2. Operational Layer (Supervision)
- Process restart on crashes
- Resource cleanup
- Health monitoring

# 3. Persistence Layer (External)
- Error history storage
- Metrics collection
- Analytics and reporting

1. Add JidoSystem Supervision

defmodule JidoSystem.Application do
  use Application
  
  def start(_type, _args) do
    children = [
      # Critical agent supervision
      {DynamicSupervisor, name: JidoSystem.AgentSupervisor, strategy: :one_for_one},
      
      # Error persistence service
      JidoSystem.ErrorStore,
      
      # Agent health monitoring
      JidoSystem.HealthMonitor
    ]
    
    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

2. Separate Error Counting Concerns

defmodule TaskAgent do
  # Business logic - in agent state
  def handle_error(agent, error) do
    new_count = agent.state.error_count + 1
    
    # Emit telemetry for monitoring
    :telemetry.execute([:task_agent, :error], %{count: 1}, %{
      agent_id: agent.id,
      error_type: classify_error(error)
    })
    
    # Apply business rules
    new_state = update_error_count(agent.state, new_count)
    {:ok, %{agent | state: new_state}}
  end
  
  # Operational concern - separate process
  defp emit_error_telemetry(agent_id, error) do
    JidoSystem.ErrorStore.record_error(agent_id, error)
  end
end

3. External Error Persistence

defmodule JidoSystem.ErrorStore do
  use GenServer
  
  # Supervised service for error persistence
  def record_error(agent_id, error) do
    GenServer.cast(__MODULE__, {:record_error, agent_id, error})
  end
  
  def get_error_count(agent_id) do
    GenServer.call(__MODULE__, {:get_count, agent_id})
  end
  
  # Initialize agent error count from persistent store
  def initialize_agent_errors(agent_id) do
    get_error_count(agent_id)
  end
end

4. Proper Integration with Foundation

# Foundation supervises JidoSystem
Foundation.Supervisor
├── Foundation.Services.Supervisor
├── MABEAM.Supervisor  
└── JidoSystem.Supervisor  # NEW - proper supervision
    ├── JidoSystem.AgentSupervisor
    ├── JidoSystem.ErrorStore
    └── JidoSystem.HealthMonitor

Implementation Strategy

Phase 1: Fix Current Error Counting (Immediate)

  1. Remove hacky spawning and async mechanisms
  2. Implement error counting in agent action flow
  3. Use telemetry for error event emission
  4. Add external error persistence if needed

Phase 2: Add JidoSystem Supervision (Short Term)

  1. Create JidoSystem.Application module
  2. Add DynamicSupervisor for critical agents
  3. Implement proper child specifications
  4. Add health monitoring service

Phase 3: Integrate with Foundation Supervision (Medium Term)

  1. Add JidoSystem.Supervisor to Foundation supervision tree
  2. Coordinate error handling across supervision layers
  3. Implement proper resource management
  4. Add comprehensive monitoring and alerting

Key Principles for Supervision Strategy

Separation of Concerns

  • Supervision: Process lifecycle, crash recovery, resource cleanup
  • Business Logic: Application state, error counting, business rules
  • Persistence: External storage, metrics, analytics

State Management

  • Ephemeral State: In agent state, rebuilt on restart
  • Persistent State: External services, survives restarts
  • Operational State: Supervision metadata, health status

Error Boundaries

  • Agent Level: Business logic errors, handled gracefully
  • Process Level: Crashes, handled by supervisor restart
  • System Level: Service failures, handled by supervision tree

Design Principles

  1. Let OTP do what it’s designed for - process management
  2. Don’t fight the framework - work with Jido’s architecture
  3. Separate business logic from operational concerns
  4. Use proper persistence for data that must survive restarts
  5. Design for observability - comprehensive telemetry and monitoring

Conclusion

The error counting issue is a symptom of architectural confusion between OTP supervision and business logic state management. The solution is not to hack around the frameworks, but to properly separate concerns and use each layer for its intended purpose.

By implementing proper supervision strategy and separating business logic from operational concerns, we can build a robust, maintainable system that leverages the strengths of both Foundation and Jido frameworks while following established OTP principles.