LIB_OLD Functionality Integration Plan - Sound Architecture Rebuild
Executive Summary
This plan outlines how to integrate functionality from lib_old into the current Foundation/JidoSystem in a way that maintains sound architectural principles. The old system was functionally rich but architecturally broken with ad-hoc processes, missing supervision, broken concurrency primitives, and poor layer separation. This rebuild focuses on correct design, proper supervision, and clean coupling between Jido and Foundation systems.
Architectural Principles for Sound Design
1. Supervision-First Architecture
- Every process must have a supervisor - No ad-hoc spawning
- Proper supervision trees - Clear parent-child relationships
- Graceful shutdown - All processes must handle termination properly
- Restart strategies - Appropriate restart policies for each service
2. Protocol-Based Coupling
- Foundation protocols as interfaces - Clean abstraction boundaries
- Jido agents use protocols, not implementations - Loose coupling
- Swappable implementations - Production vs test vs development
- Clear service contracts - Well-defined interface specifications
3. Proper Concurrency Patterns
- GenServer for stateful services - Proper state management
- Task/Agent for stateless operations - Appropriate concurrency primitives
- Proper message passing - No shared mutable state
- Backpressure handling - Prevent resource exhaustion
4. Clear Layer Separation
┌─────────────────────────────────┐
│ JidoSystem │ ← Application Layer (Agents, Actions, Sensors)
├─────────────────────────────────┤
│ JidoFoundation │ ← Integration Layer (Bridge, SignalRouter)
├─────────────────────────────────┤
│ Foundation │ ← Infrastructure Layer (Services, Protocols)
└─────────────────────────────────┘
Critical Design Lessons from lib_old Failures
❌ What NOT to Repeat from lib_old
Ad-hoc Process Spawning
- Old:
spawn/1
andTask.start/1
everywhere - New: All processes under proper supervision
- Old:
Broken Service Coupling
- Old: Direct module calls between layers
- New: Protocol-based interfaces with clear boundaries
Missing Error Boundaries
- Old: Errors cascade across unrelated services
- New: Circuit breakers and error isolation
Inconsistent State Management
- Old: Mixed ETS/Agent/GenServer with no clear pattern
- New: Consistent state management per service type
No Graceful Degradation
- Old: Single point of failure bringing down entire system
- New: Services fail independently with fallbacks
Sound Architecture Implementation Plan
Phase 1: Foundation Infrastructure Services (Weeks 1-2)
1.1 Service Supervision Architecture
Goal: Establish proper supervision tree for all Foundation services
Implementation:
# Foundation.Application supervision tree
def children do
[
# Core infrastructure services
{Foundation.Services.Supervisor, []},
{Foundation.Infrastructure.Supervisor, []},
# Integration layer
{JidoFoundation.Supervisor, []},
# JidoSystem (if configured)
{JidoSystem.Supervisor, []}
]
end
Services to Implement:
- Foundation.Services.Supervisor - Manages service layer
- Foundation.Infrastructure.Supervisor - Manages infrastructure services
- Proper process registration - Named processes with clear ownership
Architecture Requirements:
- Each service is a GenServer under supervision
- Services register via Foundation.Registry protocol
- Graceful shutdown with cleanup hooks
- Health checks integrated into supervision
1.2 Retry Service with ElixirRetry
Goal: Sound retry mechanisms using proven ElixirRetry library
Implementation:
defmodule Foundation.Services.RetryService do
use GenServer
# Clean interface using ElixirRetry
def retry_operation(operation, policy_name) do
policy = get_policy(policy_name)
Retry.retry with: policy do
operation.()
end
end
# Policies configured in supervision tree
defp get_policy(:exponential_backoff) do
linear_backoff(500, 2) |> cap(10_000) |> expiry(30_000)
end
end
Architecture Benefits:
- Centralized retry configuration
- Clear separation of retry logic from business logic
- Proper telemetry integration
- Circuit breaker integration
1.3 Enhanced Circuit Breaker Service
Goal: Production-grade circuit breaker with proper supervision
Current Issue: Existing circuit breaker is basic, needs enhancement Solution: Extend current Foundation.Infrastructure.CircuitBreaker
Implementation:
defmodule Foundation.Infrastructure.CircuitBreaker do
use GenServer
# Proper state management
defstruct [
:service_configs,
:circuit_states,
:telemetry_handler,
:cleanup_timer
]
# Integration with retry service
def call_with_retry(service_id, operation, retry_policy) do
case call(service_id, operation) do
{:error, :circuit_open} ->
Foundation.Services.RetryService.retry_operation(
fn -> call(service_id, operation) end,
retry_policy
)
result -> result
end
end
end
Architecture Improvements:
- Proper GenServer state management
- Integration with telemetry service
- Clean coupling with retry service
- Supervision tree integration
Phase 2: Service Layer Architecture (Weeks 3-4)
2.1 Configuration Service
Goal: Centralized configuration with hot-reload capability
Architecture:
defmodule Foundation.Services.ConfigService do
use GenServer
# Clean interface for agents
def get_config(service_id, key, default \\ nil)
def update_config(service_id, updates)
def subscribe_to_changes(service_id, subscriber_pid)
# Proper supervision integration
def child_spec(opts) do
%{
id: __MODULE__,
start: {__MODULE__, :start_link, [opts]},
restart: :permanent,
type: :worker
}
end
end
Integration with JidoSystem:
# In TaskAgent
def init(opts) do
# Subscribe to config changes
Foundation.Services.ConfigService.subscribe_to_changes(:task_agent, self())
config = Foundation.Services.ConfigService.get_config(:task_agent, :all, default_config())
# ... rest of init
end
def handle_info({:config_updated, new_config}, state) do
# Hot-reload configuration
updated_state = apply_config_changes(state, new_config)
{:noreply, updated_state}
end
Sound Design Principles:
- Configuration changes are messages, not direct calls
- Each service manages its own configuration subscription
- No shared mutable configuration state
- Clear error boundaries for configuration failures
2.2 Service Discovery
Goal: Dynamic service discovery with health checking
Architecture:
defmodule Foundation.Services.ServiceDiscovery do
use GenServer
# Protocol-based interface
@behaviour Foundation.ServiceDiscovery
# Clean service registration
def register_service(service_id, capabilities, health_check_fun)
def discover_services(capability_requirements)
def get_service_health(service_id)
# Proper health checking with supervision
defp schedule_health_checks(state) do
# Use supervised Task for health checks
Task.Supervisor.start_child(
Foundation.TaskSupervisor,
fn -> perform_health_checks(state.services) end
)
end
end
JidoSystem Integration:
# FoundationAgent registers itself
def init(opts) do
capabilities = Keyword.get(opts, :capabilities, [])
Foundation.Services.ServiceDiscovery.register_service(
self(),
capabilities,
&health_check/0
)
# ... rest of init
end
# CoordinatorAgent discovers agents
def delegate_task(task) do
required_capabilities = extract_capabilities(task)
case Foundation.Services.ServiceDiscovery.discover_services(required_capabilities) do
{:ok, agents} -> select_best_agent(agents, task)
{:error, :no_agents} -> {:error, :no_suitable_agents}
end
end
2.3 Telemetry Service
Goal: Centralized telemetry with proper aggregation
Architecture:
defmodule Foundation.Services.TelemetryService do
use GenServer
# Clean aggregation interface
def emit_metric(metric_name, value, tags \\ %{})
def register_metric_collector(collector_fun)
def get_metrics_summary(time_range)
# Proper state management
defstruct [
:metrics_buffer,
:collectors,
:aggregation_timer,
:storage_backend
]
# Graceful shutdown
def terminate(reason, state) do
# Flush remaining metrics
flush_metrics_buffer(state.metrics_buffer)
:ok
end
end
Integration Pattern:
# In agents, use telemetry service instead of direct :telemetry
def process_task(task) do
start_time = System.monotonic_time()
result = perform_task(task)
duration = System.monotonic_time() - start_time
Foundation.Services.TelemetryService.emit_metric(
"task.duration",
duration,
%{agent_id: self(), task_type: task.type}
)
result
end
Phase 3: Advanced Infrastructure (Weeks 5-6)
3.1 Connection Manager
Goal: Proper HTTP connection pooling with supervision
Architecture:
defmodule Foundation.Infrastructure.ConnectionManager do
use GenServer
# Supervised connection pools
def child_spec(opts) do
%{
id: __MODULE__,
start: {__MODULE__, :start_link, [opts]},
restart: :permanent,
type: :supervisor # This is a supervisor for connection pools
}
end
# Clean interface
def get_connection(service_name, opts \\ [])
def return_connection(service_name, connection)
def get_pool_status(service_name)
# Proper pool management
defp init_connection_pool(service_name, config) do
pool_spec = {
Finch,
name: pool_name(service_name),
pools: %{
config.base_url => [
size: config.pool_size,
conn_opts: config.connection_opts
]
}
}
DynamicSupervisor.start_child(__MODULE__, pool_spec)
end
end
Sound Integration:
# Agents use connection manager, not direct HTTP
def call_external_service(endpoint, data) do
case Foundation.Infrastructure.ConnectionManager.get_connection(:external_api) do
{:ok, connection} ->
result = make_http_request(connection, endpoint, data)
Foundation.Infrastructure.ConnectionManager.return_connection(:external_api, connection)
result
{:error, :no_connections} ->
{:error, :service_unavailable}
end
end
3.2 Rate Limiter
Goal: Proper rate limiting with Hammer integration
Architecture:
defmodule Foundation.Infrastructure.RateLimiter do
use GenServer
# Protocol compliance
@behaviour Foundation.Infrastructure
# Clean interface
def check_rate_limit(service_id, identifier, action \\ :default)
def get_rate_limit_status(service_id, identifier)
def configure_rate_limit(service_id, config)
# Proper Hammer integration
defp init_hammer_backend do
hammer_config = [
backend: {Hammer.Backend.ETS, [expiry_ms: 60_000 * 60 * 2]}
]
Application.put_env(:hammer, :backend, hammer_config)
end
end
Phase 4: Storage and Persistence (Weeks 7-8)
4.1 Event Store Service
Goal: Persistent event storage with proper supervision
Architecture:
defmodule Foundation.Services.EventStore do
use GenServer
# Clean event interface
def store_event(stream_id, event_type, event_data, metadata \\ %{})
def get_events(stream_id, from_version \\ 0)
def subscribe_to_stream(stream_id, subscriber_pid)
# Proper state management
defstruct [
:storage_backend,
:subscribers,
:event_buffer,
:flush_timer
]
# Storage backend protocol
@behaviour Foundation.EventStorage
end
JidoSystem Integration:
# TaskAgent stores task events
def process_task(task) do
Foundation.Services.EventStore.store_event(
"task_agent_#{self()}",
:task_started,
%{task_id: task.id, task_type: task.type}
)
result = perform_task_processing(task)
Foundation.Services.EventStore.store_event(
"task_agent_#{self()}",
:task_completed,
%{task_id: task.id, result: result}
)
result
end
4.2 Persistent Task Queue
Goal: Durable task queues with proper recovery
Architecture:
defmodule Foundation.Infrastructure.PersistentQueue do
use GenServer
# Clean queue interface
def enqueue(queue_name, item, priority \\ 0)
def dequeue(queue_name)
def peek(queue_name, count \\ 1)
def get_queue_stats(queue_name)
# Proper recovery
def init(opts) do
queue_name = Keyword.fetch!(opts, :queue_name)
# Recover queue from storage
stored_items = recover_queue_from_storage(queue_name)
state = %{
queue_name: queue_name,
items: :queue.from_list(stored_items),
storage_backend: get_storage_backend(),
persistence_timer: schedule_persistence()
}
{:ok, state}
end
end
TaskAgent Integration:
# TaskAgent uses persistent queue
def init(opts) do
queue_name = "task_queue_#{self()}"
# Start supervised persistent queue
queue_spec = {Foundation.Infrastructure.PersistentQueue, queue_name: queue_name}
{:ok, queue_pid} = DynamicSupervisor.start_child(Foundation.QueueSupervisor, queue_spec)
state = %{
queue_pid: queue_pid,
# ... other state
}
{:ok, state}
end
Phase 5: Monitoring and Alerting (Weeks 9-10)
5.1 Alerting Service
Goal: Proper alert delivery with escalation
Architecture:
defmodule Foundation.Services.AlertingService do
use GenServer
# Clean alerting interface
def send_alert(alert_type, severity, message, metadata \\ %{})
def register_alert_handler(handler_id, handler_fun)
def configure_escalation_policy(policy_name, policy_config)
# Proper alert handling
defstruct [
:alert_handlers,
:escalation_policies,
:active_alerts,
:alert_history
]
# Integration with monitoring
def handle_threshold_violation(metric_name, current_value, threshold) do
alert_data = %{
metric: metric_name,
current_value: current_value,
threshold: threshold,
timestamp: DateTime.utc_now()
}
send_alert(:threshold_violation, :warning, "Threshold exceeded", alert_data)
end
end
5.2 Health Check Service
Goal: Comprehensive health monitoring
Architecture:
defmodule Foundation.Services.HealthCheckService do
use GenServer
# Clean health check interface
def register_health_check(service_id, check_fun, interval \\ 30_000)
def get_service_health(service_id)
def get_system_health()
# Proper scheduling
defp schedule_health_checks(state) do
Enum.each(state.health_checks, fn {service_id, config} ->
Process.send_after(self(), {:check_health, service_id}, config.interval)
end)
end
end
Implementation Guidelines
1. Supervision Tree Design
# Foundation.Application
def start(_type, _args) do
children = [
# Core infrastructure
{Foundation.Infrastructure.Supervisor, []},
# Service layer
{Foundation.Services.Supervisor, []},
# Task supervision for async work
{Task.Supervisor, name: Foundation.TaskSupervisor},
{DynamicSupervisor, name: Foundation.DynamicSupervisor},
# Integration layer
{JidoFoundation.Supervisor, []}
]
opts = [strategy: :one_for_one, name: Foundation.Supervisor]
Supervisor.start_link(children, opts)
end
2. Error Boundary Pattern
# Each service has proper error boundaries
def handle_call({:risky_operation, params}, _from, state) do
try do
result = perform_risky_operation(params)
{:reply, {:ok, result}, state}
rescue
error ->
# Log error but don't crash the service
Logger.error("Operation failed: #{Exception.message(error)}")
# Emit telemetry for monitoring
Foundation.Services.TelemetryService.emit_metric(
"service.error",
1,
%{service: __MODULE__, operation: :risky_operation}
)
{:reply, {:error, :operation_failed}, state}
end
end
3. Protocol-Based Integration
# JidoSystem agents use protocols, not direct service calls
def get_configuration(key) do
# Use protocol, not direct module call
config_impl = Foundation.get_service_impl(:configuration)
Foundation.Configuration.get_config(config_impl, :task_agent, key)
end
Success Criteria
- Zero Ad-hoc Processes - All processes under supervision
- Clean Layer Separation - No direct cross-layer dependencies
- Graceful Degradation - Services fail independently
- Proper Error Boundaries - Errors don’t cascade
- Protocol Compliance - All services implement Foundation protocols
- Comprehensive Monitoring - Full observability of service health
- Hot Configuration - Configuration updates without restart
- Persistent State - Critical state survives restarts
Testing Strategy
- Unit Tests - Each service tested in isolation
- Integration Tests - Service interactions tested
- Supervision Tests - Verify proper restart behavior
- Chaos Testing - Random service failures
- Load Testing - Verify backpressure handling
- Recovery Testing - State recovery after restart
This plan ensures we get all the functionality from lib_old while maintaining sound architectural principles and avoiding the design flaws that made the original system unmaintainable.