LIB OLD PORT PLAN

Documentation for LIB_OLD_PORT_PLAN from the Foundation repository.

LIB_OLD Functionality Integration Plan - Sound Architecture Rebuild

Executive Summary

This plan outlines how to integrate functionality from lib_old into the current Foundation/JidoSystem in a way that maintains sound architectural principles. The old system was functionally rich but architecturally broken with ad-hoc processes, missing supervision, broken concurrency primitives, and poor layer separation. This rebuild focuses on correct design, proper supervision, and clean coupling between Jido and Foundation systems.

Architectural Principles for Sound Design

1. Supervision-First Architecture

Every process must have a supervisor - No ad-hoc spawning
Proper supervision trees - Clear parent-child relationships
Graceful shutdown - All processes must handle termination properly
Restart strategies - Appropriate restart policies for each service

2. Protocol-Based Coupling

Foundation protocols as interfaces - Clean abstraction boundaries
Jido agents use protocols, not implementations - Loose coupling
Swappable implementations - Production vs test vs development
Clear service contracts - Well-defined interface specifications

3. Proper Concurrency Patterns

GenServer for stateful services - Proper state management
Task/Agent for stateless operations - Appropriate concurrency primitives
Proper message passing - No shared mutable state
Backpressure handling - Prevent resource exhaustion

4. Clear Layer Separation

┌─────────────────────────────────┐
│         JidoSystem              │ ← Application Layer (Agents, Actions, Sensors)
├─────────────────────────────────┤
│       JidoFoundation            │ ← Integration Layer (Bridge, SignalRouter)
├─────────────────────────────────┤
│        Foundation               │ ← Infrastructure Layer (Services, Protocols)
└─────────────────────────────────┘

Critical Design Lessons from lib_old Failures

❌ What NOT to Repeat from lib_old

Ad-hoc Process Spawning
- Old: spawn/1 and Task.start/1 everywhere
- New: All processes under proper supervision
Broken Service Coupling
- Old: Direct module calls between layers
- New: Protocol-based interfaces with clear boundaries
Missing Error Boundaries
- Old: Errors cascade across unrelated services
- New: Circuit breakers and error isolation
Inconsistent State Management
- Old: Mixed ETS/Agent/GenServer with no clear pattern
- New: Consistent state management per service type
No Graceful Degradation
- Old: Single point of failure bringing down entire system
- New: Services fail independently with fallbacks

Sound Architecture Implementation Plan

Phase 1: Foundation Infrastructure Services (Weeks 1-2)

1.1 Service Supervision Architecture

Goal: Establish proper supervision tree for all Foundation services

Implementation:

# Foundation.Application supervision tree
def children do
  [
    # Core infrastructure services
    {Foundation.Services.Supervisor, []},
    {Foundation.Infrastructure.Supervisor, []},
    
    # Integration layer
    {JidoFoundation.Supervisor, []},
    
    # JidoSystem (if configured)
    {JidoSystem.Supervisor, []}
  ]
end

Services to Implement:

Foundation.Services.Supervisor - Manages service layer
Foundation.Infrastructure.Supervisor - Manages infrastructure services
Proper process registration - Named processes with clear ownership

Architecture Requirements:

Each service is a GenServer under supervision
Services register via Foundation.Registry protocol
Graceful shutdown with cleanup hooks
Health checks integrated into supervision

1.2 Retry Service with ElixirRetry

Goal: Sound retry mechanisms using proven ElixirRetry library

Implementation:

defmodule Foundation.Services.RetryService do
  use GenServer
  
  # Clean interface using ElixirRetry
  def retry_operation(operation, policy_name) do
    policy = get_policy(policy_name)
    Retry.retry with: policy do
      operation.()
    end
  end
  
  # Policies configured in supervision tree
  defp get_policy(:exponential_backoff) do
    linear_backoff(500, 2) |> cap(10_000) |> expiry(30_000)
  end
end

Architecture Benefits:

Centralized retry configuration
Clear separation of retry logic from business logic
Proper telemetry integration
Circuit breaker integration

1.3 Enhanced Circuit Breaker Service

Goal: Production-grade circuit breaker with proper supervision

Current Issue: Existing circuit breaker is basic, needs enhancement Solution: Extend current Foundation.Infrastructure.CircuitBreaker

Implementation:

defmodule Foundation.Infrastructure.CircuitBreaker do
  use GenServer
  
  # Proper state management
  defstruct [
    :service_configs,
    :circuit_states,
    :telemetry_handler,
    :cleanup_timer
  ]
  
  # Integration with retry service
  def call_with_retry(service_id, operation, retry_policy) do
    case call(service_id, operation) do
      {:error, :circuit_open} -> 
        Foundation.Services.RetryService.retry_operation(
          fn -> call(service_id, operation) end,
          retry_policy
        )
      result -> result
    end
  end
end

Architecture Improvements:

Proper GenServer state management
Integration with telemetry service
Clean coupling with retry service
Supervision tree integration

Phase 2: Service Layer Architecture (Weeks 3-4)

2.1 Configuration Service

Goal: Centralized configuration with hot-reload capability

Architecture:

defmodule Foundation.Services.ConfigService do
  use GenServer
  
  # Clean interface for agents
  def get_config(service_id, key, default \\ nil)
  def update_config(service_id, updates)
  def subscribe_to_changes(service_id, subscriber_pid)
  
  # Proper supervision integration
  def child_spec(opts) do
    %{
      id: __MODULE__,
      start: {__MODULE__, :start_link, [opts]},
      restart: :permanent,
      type: :worker
    }
  end
end

Integration with JidoSystem:

# In TaskAgent
def init(opts) do
  # Subscribe to config changes
  Foundation.Services.ConfigService.subscribe_to_changes(:task_agent, self())
  
  config = Foundation.Services.ConfigService.get_config(:task_agent, :all, default_config())
  # ... rest of init
end

def handle_info({:config_updated, new_config}, state) do
  # Hot-reload configuration
  updated_state = apply_config_changes(state, new_config)
  {:noreply, updated_state}
end

Sound Design Principles:

Configuration changes are messages, not direct calls
Each service manages its own configuration subscription
No shared mutable configuration state
Clear error boundaries for configuration failures

2.2 Service Discovery

Goal: Dynamic service discovery with health checking

Architecture:

defmodule Foundation.Services.ServiceDiscovery do
  use GenServer
  
  # Protocol-based interface
  @behaviour Foundation.ServiceDiscovery
  
  # Clean service registration
  def register_service(service_id, capabilities, health_check_fun)
  def discover_services(capability_requirements)
  def get_service_health(service_id)
  
  # Proper health checking with supervision
  defp schedule_health_checks(state) do
    # Use supervised Task for health checks
    Task.Supervisor.start_child(
      Foundation.TaskSupervisor,
      fn -> perform_health_checks(state.services) end
    )
  end
end

JidoSystem Integration:

# FoundationAgent registers itself
def init(opts) do
  capabilities = Keyword.get(opts, :capabilities, [])
  
  Foundation.Services.ServiceDiscovery.register_service(
    self(),
    capabilities,
    &health_check/0
  )
  
  # ... rest of init
end

# CoordinatorAgent discovers agents
def delegate_task(task) do
  required_capabilities = extract_capabilities(task)
  
  case Foundation.Services.ServiceDiscovery.discover_services(required_capabilities) do
    {:ok, agents} -> select_best_agent(agents, task)
    {:error, :no_agents} -> {:error, :no_suitable_agents}
  end
end

2.3 Telemetry Service

Goal: Centralized telemetry with proper aggregation

Architecture:

defmodule Foundation.Services.TelemetryService do
  use GenServer
  
  # Clean aggregation interface
  def emit_metric(metric_name, value, tags \\ %{})
  def register_metric_collector(collector_fun)
  def get_metrics_summary(time_range)
  
  # Proper state management
  defstruct [
    :metrics_buffer,
    :collectors,
    :aggregation_timer,
    :storage_backend
  ]
  
  # Graceful shutdown
  def terminate(reason, state) do
    # Flush remaining metrics
    flush_metrics_buffer(state.metrics_buffer)
    :ok
  end
end

Integration Pattern:

# In agents, use telemetry service instead of direct :telemetry
def process_task(task) do
  start_time = System.monotonic_time()
  
  result = perform_task(task)
  
  duration = System.monotonic_time() - start_time
  Foundation.Services.TelemetryService.emit_metric(
    "task.duration",
    duration,
    %{agent_id: self(), task_type: task.type}
  )
  
  result
end

Phase 3: Advanced Infrastructure (Weeks 5-6)

3.1 Connection Manager

Goal: Proper HTTP connection pooling with supervision

Architecture:

defmodule Foundation.Infrastructure.ConnectionManager do
  use GenServer
  
  # Supervised connection pools
  def child_spec(opts) do
    %{
      id: __MODULE__,
      start: {__MODULE__, :start_link, [opts]},
      restart: :permanent,
      type: :supervisor  # This is a supervisor for connection pools
    }
  end
  
  # Clean interface
  def get_connection(service_name, opts \\ [])
  def return_connection(service_name, connection)
  def get_pool_status(service_name)
  
  # Proper pool management
  defp init_connection_pool(service_name, config) do
    pool_spec = {
      Finch,
      name: pool_name(service_name),
      pools: %{
        config.base_url => [
          size: config.pool_size,
          conn_opts: config.connection_opts
        ]
      }
    }
    
    DynamicSupervisor.start_child(__MODULE__, pool_spec)
  end
end

Sound Integration:

# Agents use connection manager, not direct HTTP
def call_external_service(endpoint, data) do
  case Foundation.Infrastructure.ConnectionManager.get_connection(:external_api) do
    {:ok, connection} ->
      result = make_http_request(connection, endpoint, data)
      Foundation.Infrastructure.ConnectionManager.return_connection(:external_api, connection)
      result
    {:error, :no_connections} ->
      {:error, :service_unavailable}
  end
end

3.2 Rate Limiter

Goal: Proper rate limiting with Hammer integration

Architecture:

defmodule Foundation.Infrastructure.RateLimiter do
  use GenServer
  
  # Protocol compliance
  @behaviour Foundation.Infrastructure
  
  # Clean interface
  def check_rate_limit(service_id, identifier, action \\ :default)
  def get_rate_limit_status(service_id, identifier)
  def configure_rate_limit(service_id, config)
  
  # Proper Hammer integration
  defp init_hammer_backend do
    hammer_config = [
      backend: {Hammer.Backend.ETS, [expiry_ms: 60_000 * 60 * 2]}
    ]
    
    Application.put_env(:hammer, :backend, hammer_config)
  end
end

Phase 4: Storage and Persistence (Weeks 7-8)

4.1 Event Store Service

Goal: Persistent event storage with proper supervision

Architecture:

defmodule Foundation.Services.EventStore do
  use GenServer
  
  # Clean event interface
  def store_event(stream_id, event_type, event_data, metadata \\ %{})
  def get_events(stream_id, from_version \\ 0)
  def subscribe_to_stream(stream_id, subscriber_pid)
  
  # Proper state management
  defstruct [
    :storage_backend,
    :subscribers,
    :event_buffer,
    :flush_timer
  ]
  
  # Storage backend protocol
  @behaviour Foundation.EventStorage
end

JidoSystem Integration:

# TaskAgent stores task events
def process_task(task) do
  Foundation.Services.EventStore.store_event(
    "task_agent_#{self()}",
    :task_started,
    %{task_id: task.id, task_type: task.type}
  )
  
  result = perform_task_processing(task)
  
  Foundation.Services.EventStore.store_event(
    "task_agent_#{self()}",
    :task_completed,
    %{task_id: task.id, result: result}
  )
  
  result
end

4.2 Persistent Task Queue

Goal: Durable task queues with proper recovery

Architecture:

defmodule Foundation.Infrastructure.PersistentQueue do
  use GenServer
  
  # Clean queue interface
  def enqueue(queue_name, item, priority \\ 0)
  def dequeue(queue_name)
  def peek(queue_name, count \\ 1)
  def get_queue_stats(queue_name)
  
  # Proper recovery
  def init(opts) do
    queue_name = Keyword.fetch!(opts, :queue_name)
    
    # Recover queue from storage
    stored_items = recover_queue_from_storage(queue_name)
    
    state = %{
      queue_name: queue_name,
      items: :queue.from_list(stored_items),
      storage_backend: get_storage_backend(),
      persistence_timer: schedule_persistence()
    }
    
    {:ok, state}
  end
end

TaskAgent Integration:

# TaskAgent uses persistent queue
def init(opts) do
  queue_name = "task_queue_#{self()}"
  
  # Start supervised persistent queue
  queue_spec = {Foundation.Infrastructure.PersistentQueue, queue_name: queue_name}
  {:ok, queue_pid} = DynamicSupervisor.start_child(Foundation.QueueSupervisor, queue_spec)
  
  state = %{
    queue_pid: queue_pid,
    # ... other state
  }
  
  {:ok, state}
end

Phase 5: Monitoring and Alerting (Weeks 9-10)

5.1 Alerting Service

Goal: Proper alert delivery with escalation

Architecture:

defmodule Foundation.Services.AlertingService do
  use GenServer
  
  # Clean alerting interface
  def send_alert(alert_type, severity, message, metadata \\ %{})
  def register_alert_handler(handler_id, handler_fun)
  def configure_escalation_policy(policy_name, policy_config)
  
  # Proper alert handling
  defstruct [
    :alert_handlers,
    :escalation_policies,
    :active_alerts,
    :alert_history
  ]
  
  # Integration with monitoring
  def handle_threshold_violation(metric_name, current_value, threshold) do
    alert_data = %{
      metric: metric_name,
      current_value: current_value,
      threshold: threshold,
      timestamp: DateTime.utc_now()
    }
    
    send_alert(:threshold_violation, :warning, "Threshold exceeded", alert_data)
  end
end

5.2 Health Check Service

Goal: Comprehensive health monitoring

Architecture:

defmodule Foundation.Services.HealthCheckService do
  use GenServer
  
  # Clean health check interface
  def register_health_check(service_id, check_fun, interval \\ 30_000)
  def get_service_health(service_id)
  def get_system_health()
  
  # Proper scheduling
  defp schedule_health_checks(state) do
    Enum.each(state.health_checks, fn {service_id, config} ->
      Process.send_after(self(), {:check_health, service_id}, config.interval)
    end)
  end
end

Implementation Guidelines

1. Supervision Tree Design

# Foundation.Application
def start(_type, _args) do
  children = [
    # Core infrastructure
    {Foundation.Infrastructure.Supervisor, []},
    
    # Service layer
    {Foundation.Services.Supervisor, []},
    
    # Task supervision for async work
    {Task.Supervisor, name: Foundation.TaskSupervisor},
    {DynamicSupervisor, name: Foundation.DynamicSupervisor},
    
    # Integration layer
    {JidoFoundation.Supervisor, []}
  ]
  
  opts = [strategy: :one_for_one, name: Foundation.Supervisor]
  Supervisor.start_link(children, opts)
end

2. Error Boundary Pattern

# Each service has proper error boundaries
def handle_call({:risky_operation, params}, _from, state) do
  try do
    result = perform_risky_operation(params)
    {:reply, {:ok, result}, state}
  rescue
    error ->
      # Log error but don't crash the service
      Logger.error("Operation failed: #{Exception.message(error)}")
      
      # Emit telemetry for monitoring
      Foundation.Services.TelemetryService.emit_metric(
        "service.error",
        1,
        %{service: __MODULE__, operation: :risky_operation}
      )
      
      {:reply, {:error, :operation_failed}, state}
  end
end

3. Protocol-Based Integration

# JidoSystem agents use protocols, not direct service calls
def get_configuration(key) do
  # Use protocol, not direct module call
  config_impl = Foundation.get_service_impl(:configuration)
  Foundation.Configuration.get_config(config_impl, :task_agent, key)
end

Success Criteria

Zero Ad-hoc Processes - All processes under supervision
Clean Layer Separation - No direct cross-layer dependencies
Graceful Degradation - Services fail independently
Proper Error Boundaries - Errors don’t cascade
Protocol Compliance - All services implement Foundation protocols
Comprehensive Monitoring - Full observability of service health
Hot Configuration - Configuration updates without restart
Persistent State - Critical state survives restarts

Testing Strategy

Unit Tests - Each service tested in isolation
Integration Tests - Service interactions tested
Supervision Tests - Verify proper restart behavior
Chaos Testing - Random service failures
Load Testing - Verify backpressure handling
Recovery Testing - State recovery after restart

This plan ensures we get all the functionality from lib_old while maintaining sound architectural principles and avoiding the design flaws that made the original system unmaintainable.