← Back to Foundation

LIB OLD PORT PLAN

Documentation for LIB_OLD_PORT_PLAN from the Foundation repository.

LIB_OLD Functionality Integration Plan - Sound Architecture Rebuild

Executive Summary

This plan outlines how to integrate functionality from lib_old into the current Foundation/JidoSystem in a way that maintains sound architectural principles. The old system was functionally rich but architecturally broken with ad-hoc processes, missing supervision, broken concurrency primitives, and poor layer separation. This rebuild focuses on correct design, proper supervision, and clean coupling between Jido and Foundation systems.

Architectural Principles for Sound Design

1. Supervision-First Architecture

  • Every process must have a supervisor - No ad-hoc spawning
  • Proper supervision trees - Clear parent-child relationships
  • Graceful shutdown - All processes must handle termination properly
  • Restart strategies - Appropriate restart policies for each service

2. Protocol-Based Coupling

  • Foundation protocols as interfaces - Clean abstraction boundaries
  • Jido agents use protocols, not implementations - Loose coupling
  • Swappable implementations - Production vs test vs development
  • Clear service contracts - Well-defined interface specifications

3. Proper Concurrency Patterns

  • GenServer for stateful services - Proper state management
  • Task/Agent for stateless operations - Appropriate concurrency primitives
  • Proper message passing - No shared mutable state
  • Backpressure handling - Prevent resource exhaustion

4. Clear Layer Separation

┌─────────────────────────────────┐
│         JidoSystem              │ ← Application Layer (Agents, Actions, Sensors)
├─────────────────────────────────┤
│       JidoFoundation            │ ← Integration Layer (Bridge, SignalRouter)
├─────────────────────────────────┤
│        Foundation               │ ← Infrastructure Layer (Services, Protocols)
└─────────────────────────────────┘

Critical Design Lessons from lib_old Failures

❌ What NOT to Repeat from lib_old

  1. Ad-hoc Process Spawning

    • Old: spawn/1 and Task.start/1 everywhere
    • New: All processes under proper supervision
  2. Broken Service Coupling

    • Old: Direct module calls between layers
    • New: Protocol-based interfaces with clear boundaries
  3. Missing Error Boundaries

    • Old: Errors cascade across unrelated services
    • New: Circuit breakers and error isolation
  4. Inconsistent State Management

    • Old: Mixed ETS/Agent/GenServer with no clear pattern
    • New: Consistent state management per service type
  5. No Graceful Degradation

    • Old: Single point of failure bringing down entire system
    • New: Services fail independently with fallbacks

Sound Architecture Implementation Plan

Phase 1: Foundation Infrastructure Services (Weeks 1-2)

1.1 Service Supervision Architecture

Goal: Establish proper supervision tree for all Foundation services

Implementation:

# Foundation.Application supervision tree
def children do
  [
    # Core infrastructure services
    {Foundation.Services.Supervisor, []},
    {Foundation.Infrastructure.Supervisor, []},
    
    # Integration layer
    {JidoFoundation.Supervisor, []},
    
    # JidoSystem (if configured)
    {JidoSystem.Supervisor, []}
  ]
end

Services to Implement:

  1. Foundation.Services.Supervisor - Manages service layer
  2. Foundation.Infrastructure.Supervisor - Manages infrastructure services
  3. Proper process registration - Named processes with clear ownership

Architecture Requirements:

  • Each service is a GenServer under supervision
  • Services register via Foundation.Registry protocol
  • Graceful shutdown with cleanup hooks
  • Health checks integrated into supervision

1.2 Retry Service with ElixirRetry

Goal: Sound retry mechanisms using proven ElixirRetry library

Implementation:

defmodule Foundation.Services.RetryService do
  use GenServer
  
  # Clean interface using ElixirRetry
  def retry_operation(operation, policy_name) do
    policy = get_policy(policy_name)
    Retry.retry with: policy do
      operation.()
    end
  end
  
  # Policies configured in supervision tree
  defp get_policy(:exponential_backoff) do
    linear_backoff(500, 2) |> cap(10_000) |> expiry(30_000)
  end
end

Architecture Benefits:

  • Centralized retry configuration
  • Clear separation of retry logic from business logic
  • Proper telemetry integration
  • Circuit breaker integration

1.3 Enhanced Circuit Breaker Service

Goal: Production-grade circuit breaker with proper supervision

Current Issue: Existing circuit breaker is basic, needs enhancement Solution: Extend current Foundation.Infrastructure.CircuitBreaker

Implementation:

defmodule Foundation.Infrastructure.CircuitBreaker do
  use GenServer
  
  # Proper state management
  defstruct [
    :service_configs,
    :circuit_states,
    :telemetry_handler,
    :cleanup_timer
  ]
  
  # Integration with retry service
  def call_with_retry(service_id, operation, retry_policy) do
    case call(service_id, operation) do
      {:error, :circuit_open} -> 
        Foundation.Services.RetryService.retry_operation(
          fn -> call(service_id, operation) end,
          retry_policy
        )
      result -> result
    end
  end
end

Architecture Improvements:

  • Proper GenServer state management
  • Integration with telemetry service
  • Clean coupling with retry service
  • Supervision tree integration

Phase 2: Service Layer Architecture (Weeks 3-4)

2.1 Configuration Service

Goal: Centralized configuration with hot-reload capability

Architecture:

defmodule Foundation.Services.ConfigService do
  use GenServer
  
  # Clean interface for agents
  def get_config(service_id, key, default \\ nil)
  def update_config(service_id, updates)
  def subscribe_to_changes(service_id, subscriber_pid)
  
  # Proper supervision integration
  def child_spec(opts) do
    %{
      id: __MODULE__,
      start: {__MODULE__, :start_link, [opts]},
      restart: :permanent,
      type: :worker
    }
  end
end

Integration with JidoSystem:

# In TaskAgent
def init(opts) do
  # Subscribe to config changes
  Foundation.Services.ConfigService.subscribe_to_changes(:task_agent, self())
  
  config = Foundation.Services.ConfigService.get_config(:task_agent, :all, default_config())
  # ... rest of init
end

def handle_info({:config_updated, new_config}, state) do
  # Hot-reload configuration
  updated_state = apply_config_changes(state, new_config)
  {:noreply, updated_state}
end

Sound Design Principles:

  • Configuration changes are messages, not direct calls
  • Each service manages its own configuration subscription
  • No shared mutable configuration state
  • Clear error boundaries for configuration failures

2.2 Service Discovery

Goal: Dynamic service discovery with health checking

Architecture:

defmodule Foundation.Services.ServiceDiscovery do
  use GenServer
  
  # Protocol-based interface
  @behaviour Foundation.ServiceDiscovery
  
  # Clean service registration
  def register_service(service_id, capabilities, health_check_fun)
  def discover_services(capability_requirements)
  def get_service_health(service_id)
  
  # Proper health checking with supervision
  defp schedule_health_checks(state) do
    # Use supervised Task for health checks
    Task.Supervisor.start_child(
      Foundation.TaskSupervisor,
      fn -> perform_health_checks(state.services) end
    )
  end
end

JidoSystem Integration:

# FoundationAgent registers itself
def init(opts) do
  capabilities = Keyword.get(opts, :capabilities, [])
  
  Foundation.Services.ServiceDiscovery.register_service(
    self(),
    capabilities,
    &health_check/0
  )
  
  # ... rest of init
end

# CoordinatorAgent discovers agents
def delegate_task(task) do
  required_capabilities = extract_capabilities(task)
  
  case Foundation.Services.ServiceDiscovery.discover_services(required_capabilities) do
    {:ok, agents} -> select_best_agent(agents, task)
    {:error, :no_agents} -> {:error, :no_suitable_agents}
  end
end

2.3 Telemetry Service

Goal: Centralized telemetry with proper aggregation

Architecture:

defmodule Foundation.Services.TelemetryService do
  use GenServer
  
  # Clean aggregation interface
  def emit_metric(metric_name, value, tags \\ %{})
  def register_metric_collector(collector_fun)
  def get_metrics_summary(time_range)
  
  # Proper state management
  defstruct [
    :metrics_buffer,
    :collectors,
    :aggregation_timer,
    :storage_backend
  ]
  
  # Graceful shutdown
  def terminate(reason, state) do
    # Flush remaining metrics
    flush_metrics_buffer(state.metrics_buffer)
    :ok
  end
end

Integration Pattern:

# In agents, use telemetry service instead of direct :telemetry
def process_task(task) do
  start_time = System.monotonic_time()
  
  result = perform_task(task)
  
  duration = System.monotonic_time() - start_time
  Foundation.Services.TelemetryService.emit_metric(
    "task.duration",
    duration,
    %{agent_id: self(), task_type: task.type}
  )
  
  result
end

Phase 3: Advanced Infrastructure (Weeks 5-6)

3.1 Connection Manager

Goal: Proper HTTP connection pooling with supervision

Architecture:

defmodule Foundation.Infrastructure.ConnectionManager do
  use GenServer
  
  # Supervised connection pools
  def child_spec(opts) do
    %{
      id: __MODULE__,
      start: {__MODULE__, :start_link, [opts]},
      restart: :permanent,
      type: :supervisor  # This is a supervisor for connection pools
    }
  end
  
  # Clean interface
  def get_connection(service_name, opts \\ [])
  def return_connection(service_name, connection)
  def get_pool_status(service_name)
  
  # Proper pool management
  defp init_connection_pool(service_name, config) do
    pool_spec = {
      Finch,
      name: pool_name(service_name),
      pools: %{
        config.base_url => [
          size: config.pool_size,
          conn_opts: config.connection_opts
        ]
      }
    }
    
    DynamicSupervisor.start_child(__MODULE__, pool_spec)
  end
end

Sound Integration:

# Agents use connection manager, not direct HTTP
def call_external_service(endpoint, data) do
  case Foundation.Infrastructure.ConnectionManager.get_connection(:external_api) do
    {:ok, connection} ->
      result = make_http_request(connection, endpoint, data)
      Foundation.Infrastructure.ConnectionManager.return_connection(:external_api, connection)
      result
    {:error, :no_connections} ->
      {:error, :service_unavailable}
  end
end

3.2 Rate Limiter

Goal: Proper rate limiting with Hammer integration

Architecture:

defmodule Foundation.Infrastructure.RateLimiter do
  use GenServer
  
  # Protocol compliance
  @behaviour Foundation.Infrastructure
  
  # Clean interface
  def check_rate_limit(service_id, identifier, action \\ :default)
  def get_rate_limit_status(service_id, identifier)
  def configure_rate_limit(service_id, config)
  
  # Proper Hammer integration
  defp init_hammer_backend do
    hammer_config = [
      backend: {Hammer.Backend.ETS, [expiry_ms: 60_000 * 60 * 2]}
    ]
    
    Application.put_env(:hammer, :backend, hammer_config)
  end
end

Phase 4: Storage and Persistence (Weeks 7-8)

4.1 Event Store Service

Goal: Persistent event storage with proper supervision

Architecture:

defmodule Foundation.Services.EventStore do
  use GenServer
  
  # Clean event interface
  def store_event(stream_id, event_type, event_data, metadata \\ %{})
  def get_events(stream_id, from_version \\ 0)
  def subscribe_to_stream(stream_id, subscriber_pid)
  
  # Proper state management
  defstruct [
    :storage_backend,
    :subscribers,
    :event_buffer,
    :flush_timer
  ]
  
  # Storage backend protocol
  @behaviour Foundation.EventStorage
end

JidoSystem Integration:

# TaskAgent stores task events
def process_task(task) do
  Foundation.Services.EventStore.store_event(
    "task_agent_#{self()}",
    :task_started,
    %{task_id: task.id, task_type: task.type}
  )
  
  result = perform_task_processing(task)
  
  Foundation.Services.EventStore.store_event(
    "task_agent_#{self()}",
    :task_completed,
    %{task_id: task.id, result: result}
  )
  
  result
end

4.2 Persistent Task Queue

Goal: Durable task queues with proper recovery

Architecture:

defmodule Foundation.Infrastructure.PersistentQueue do
  use GenServer
  
  # Clean queue interface
  def enqueue(queue_name, item, priority \\ 0)
  def dequeue(queue_name)
  def peek(queue_name, count \\ 1)
  def get_queue_stats(queue_name)
  
  # Proper recovery
  def init(opts) do
    queue_name = Keyword.fetch!(opts, :queue_name)
    
    # Recover queue from storage
    stored_items = recover_queue_from_storage(queue_name)
    
    state = %{
      queue_name: queue_name,
      items: :queue.from_list(stored_items),
      storage_backend: get_storage_backend(),
      persistence_timer: schedule_persistence()
    }
    
    {:ok, state}
  end
end

TaskAgent Integration:

# TaskAgent uses persistent queue
def init(opts) do
  queue_name = "task_queue_#{self()}"
  
  # Start supervised persistent queue
  queue_spec = {Foundation.Infrastructure.PersistentQueue, queue_name: queue_name}
  {:ok, queue_pid} = DynamicSupervisor.start_child(Foundation.QueueSupervisor, queue_spec)
  
  state = %{
    queue_pid: queue_pid,
    # ... other state
  }
  
  {:ok, state}
end

Phase 5: Monitoring and Alerting (Weeks 9-10)

5.1 Alerting Service

Goal: Proper alert delivery with escalation

Architecture:

defmodule Foundation.Services.AlertingService do
  use GenServer
  
  # Clean alerting interface
  def send_alert(alert_type, severity, message, metadata \\ %{})
  def register_alert_handler(handler_id, handler_fun)
  def configure_escalation_policy(policy_name, policy_config)
  
  # Proper alert handling
  defstruct [
    :alert_handlers,
    :escalation_policies,
    :active_alerts,
    :alert_history
  ]
  
  # Integration with monitoring
  def handle_threshold_violation(metric_name, current_value, threshold) do
    alert_data = %{
      metric: metric_name,
      current_value: current_value,
      threshold: threshold,
      timestamp: DateTime.utc_now()
    }
    
    send_alert(:threshold_violation, :warning, "Threshold exceeded", alert_data)
  end
end

5.2 Health Check Service

Goal: Comprehensive health monitoring

Architecture:

defmodule Foundation.Services.HealthCheckService do
  use GenServer
  
  # Clean health check interface
  def register_health_check(service_id, check_fun, interval \\ 30_000)
  def get_service_health(service_id)
  def get_system_health()
  
  # Proper scheduling
  defp schedule_health_checks(state) do
    Enum.each(state.health_checks, fn {service_id, config} ->
      Process.send_after(self(), {:check_health, service_id}, config.interval)
    end)
  end
end

Implementation Guidelines

1. Supervision Tree Design

# Foundation.Application
def start(_type, _args) do
  children = [
    # Core infrastructure
    {Foundation.Infrastructure.Supervisor, []},
    
    # Service layer
    {Foundation.Services.Supervisor, []},
    
    # Task supervision for async work
    {Task.Supervisor, name: Foundation.TaskSupervisor},
    {DynamicSupervisor, name: Foundation.DynamicSupervisor},
    
    # Integration layer
    {JidoFoundation.Supervisor, []}
  ]
  
  opts = [strategy: :one_for_one, name: Foundation.Supervisor]
  Supervisor.start_link(children, opts)
end

2. Error Boundary Pattern

# Each service has proper error boundaries
def handle_call({:risky_operation, params}, _from, state) do
  try do
    result = perform_risky_operation(params)
    {:reply, {:ok, result}, state}
  rescue
    error ->
      # Log error but don't crash the service
      Logger.error("Operation failed: #{Exception.message(error)}")
      
      # Emit telemetry for monitoring
      Foundation.Services.TelemetryService.emit_metric(
        "service.error",
        1,
        %{service: __MODULE__, operation: :risky_operation}
      )
      
      {:reply, {:error, :operation_failed}, state}
  end
end

3. Protocol-Based Integration

# JidoSystem agents use protocols, not direct service calls
def get_configuration(key) do
  # Use protocol, not direct module call
  config_impl = Foundation.get_service_impl(:configuration)
  Foundation.Configuration.get_config(config_impl, :task_agent, key)
end

Success Criteria

  1. Zero Ad-hoc Processes - All processes under supervision
  2. Clean Layer Separation - No direct cross-layer dependencies
  3. Graceful Degradation - Services fail independently
  4. Proper Error Boundaries - Errors don’t cascade
  5. Protocol Compliance - All services implement Foundation protocols
  6. Comprehensive Monitoring - Full observability of service health
  7. Hot Configuration - Configuration updates without restart
  8. Persistent State - Critical state survives restarts

Testing Strategy

  1. Unit Tests - Each service tested in isolation
  2. Integration Tests - Service interactions tested
  3. Supervision Tests - Verify proper restart behavior
  4. Chaos Testing - Random service failures
  5. Load Testing - Verify backpressure handling
  6. Recovery Testing - State recovery after restart

This plan ensures we get all the functionality from lib_old while maintaining sound architectural principles and avoiding the design flaws that made the original system unmaintainable.