058 PROCESS HIERARCHY

Documentation for 058_PROCESS_HIERARCHY from the Foundation repository.

Process Hierarchy & Supervision Architecture

Overview

This document defines the complete process supervision architecture for the Foundation + MABEAM system, addressing the critical patterns for process lifecycle management, fault tolerance, and clean shutdown procedures.

Foundation Layer Process Hierarchy

Application Supervision Tree

Foundation.Application (Supervisor, strategy: :one_for_one)
├── Foundation.ProcessRegistry (GenServer)
│   ├── ETS Table: :foundation_processes
│   ├── Namespace: :production | :test  
│   └── Responsibilities: Process discovery, service registration
│
├── Foundation.ServiceRegistry (GenServer) 
│   ├── Depends on: ProcessRegistry
│   ├── Responsibilities: High-level service coordination, dependency management
│   └── Fallback: ConfigServer for service unavailability
│
├── Foundation.TelemetryService (GenServer)
│   ├── Responsibilities: Metrics collection, event emission
│   └── Integration: :telemetry events, monitoring dashboards
│
├── Foundation.ConfigServer (GenServer)
│   ├── Responsibilities: Configuration management, change notifications
│   ├── Storage: ETS table with persistent backing
│   └── Change notifications: PubSub pattern
│
├── Foundation.EventStore (GenServer)  
│   ├── Responsibilities: Event persistence, querying, replay
│   ├── Storage: Persistent term storage
│   └── Pattern: Event sourcing with append-only log
│
├── Foundation.ConnectionManager (GenServer)
│   ├── Responsibilities: Network topology, node management
│   └── Distribution: Multi-node cluster coordination
│
├── Foundation.RateLimiter (GenServer)
│   ├── Responsibilities: Traffic control, resource protection
│   └── Algorithm: Token bucket with ETS backing
│
├── Foundation.TaskSupervisor (DynamicSupervisor)
│   ├── Strategy: :one_for_one
│   ├── Responsibilities: Ad-hoc task execution
│   └── Max children: :infinity
│
├── Foundation.HealthMonitor (GenServer)
│   ├── Responsibilities: System health tracking, alerting
│   ├── Monitors: All Foundation services + MABEAM components
│   └── Integration: Telemetry pipeline
│
└── Foundation.ServiceMonitor (GenServer)
    ├── Responsibilities: Service availability monitoring
    ├── Health checks: Periodic service pings
    └── Recovery: Automatic restart coordination

Process Registry Architecture

Namespace Isolation Pattern

# Process registration with namespace isolation
# Production: {:production, service_name}
# Test: {:test, service_name}

ProcessRegistry.register({namespace, :config_server}, self())
ProcessRegistry.lookup({namespace, :config_server})

Service Discovery Flow

Client Request → ProcessRegistry.lookup(service) → 
  ├─ Found: Return PID
  ├─ Not Found: Check ServiceRegistry for startup
  └─ Unavailable: Fallback to ConfigServer/ETS cache

MABEAM Layer Process Hierarchy

Agent Supervision Architecture

MABEAM.Application (Supervisor, strategy: :one_for_one)
├── MABEAM.Core (GenServer)
│   ├── Responsibilities: Universal Variable Orchestration
│   ├── State: Variable spaces, optimization contexts
│   └── Coordination: Parameter optimization across agents
│
├── MABEAM.AgentRegistry (GenServer) 
│   ├── Responsibilities: Agent metadata, lifecycle tracking
│   ├── State: Agent configurations, status, metrics
│   ├── ETS Table: Agent registry with fast lookups
│   └── Process monitoring: Automatic status updates on crashes
│
├── MABEAM.AgentSupervisor (DynamicSupervisor)
│   ├── Strategy: :one_for_one  
│   ├── Max children: :infinity
│   ├── Responsibilities: Agent process lifecycle
│   ├── Child spec: {agent_module, init_args, id: agent_id}
│   └── Termination: By PID using DynamicSupervisor.terminate_child/2
│
├── MABEAM.CoordinationSupervisor (Supervisor, strategy: :one_for_one)
│   ├── MABEAM.Coordination (GenServer)
│   │   ├── Responsibilities: Multi-agent consensus protocols
│   │   ├── Algorithms: Raft-like consensus, leader election
│   │   └── State: Coordination contexts, voting records
│   │
│   ├── MABEAM.Economics (GenServer)
│   │   ├── Responsibilities: Resource allocation, cost optimization
│   │   ├── State: Resource usage, cost models
│   │   └── Integration: LoadBalancer for workload distribution
│   │
│   └── MABEAM.LoadBalancer (GenServer)
│       ├── Responsibilities: Workload distribution across agents
│       ├── Algorithms: Round-robin, weighted distribution
│       └── Metrics: Agent performance, resource utilization
│
└── MABEAM.PerformanceMonitor (GenServer)
    ├── Responsibilities: Agent performance tracking
    ├── Metrics: Throughput, latency, resource usage
    ├── Integration: Foundation telemetry pipeline
    └── Alerting: Performance degradation detection

Agent Lifecycle Management

Critical Supervision Pattern Fix

The current issue stems from mixing DynamicSupervisor behavior with GenServer callbacks. Here’s the correct architecture:

AgentSupervisor (Pure DynamicSupervisor)

defmodule MABEAM.AgentSupervisor do
  use DynamicSupervisor

  # ONLY DynamicSupervisor callbacks - NO GenServer callbacks
  def init(_opts) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  # Direct function calls for agent management
  def start_agent(agent_module, init_args, agent_id) do
    child_spec = {agent_module, [init_args, [name: agent_id]]}
    DynamicSupervisor.start_child(__MODULE__, child_spec)
  end

  def stop_agent(agent_pid) when is_pid(agent_pid) do
    # CORRECT: Use PID directly with DynamicSupervisor
    DynamicSupervisor.terminate_child(__MODULE__, agent_pid)
  end
end

AgentRegistry (Pure GenServer)

defmodule MABEAM.AgentRegistry do
  use GenServer

  # Agent lifecycle coordination
  def register_agent(agent_id, config) do
    GenServer.call(__MODULE__, {:register, agent_id, config})
  end

  def start_agent(agent_id) do
    with {:ok, config} <- get_agent_config(agent_id),
         {:ok, pid} <- MABEAM.AgentSupervisor.start_agent(
           config.module, config.init_args, agent_id
         ),
         :ok <- GenServer.call(__MODULE__, {:agent_started, agent_id, pid}) do
      Process.monitor(pid)  # Monitor for crash detection
      {:ok, pid}
    end
  end

  def stop_agent(agent_id) do
    with {:ok, pid} <- get_agent_pid(agent_id),
         :ok <- MABEAM.AgentSupervisor.stop_agent(pid),
         :ok <- GenServer.call(__MODULE__, {:agent_stopped, agent_id}) do
      :ok
    end
  end

  # GenServer callbacks handle state management
  def handle_info({:DOWN, _ref, :process, pid, _reason}, state) do
    # Update agent status on crash
    agent_id = find_agent_by_pid(state, pid)
    new_state = update_agent_status(state, agent_id, :crashed)
    {:noreply, new_state}
  end
end

Process Termination Sequence

Correct OTP Shutdown Pattern

1. AgentRegistry.stop_agent(agent_id)
2. ├─ Lookup agent PID from registry state
3. ├─ MABEAM.AgentSupervisor.stop_agent(pid)
4. │  └─ DynamicSupervisor.terminate_child(__MODULE__, pid)
5. │     └─ OTP sends :shutdown signal to agent process
6. │        └─ Agent process terminates gracefully
7. ├─ Monitor receives {:DOWN, ...} message  
8. ├─ AgentRegistry updates agent status to :stopped
9. └─ Return :ok to caller

Critical Fix: No Process.sleep Required

Following SLEEP.md principles, the OTP supervision system guarantees:

DynamicSupervisor.terminate_child/2 returns after process termination
Process monitors fire {:DOWN, ...} messages synchronously
No artificial delays needed - trust OTP guarantees

Fault Tolerance Patterns

Supervision Strategies

Foundation Services: :one_for_one

Individual service failures don’t affect other services
Failed services restart automatically with exponential backoff
Dependency-aware startup order through ServiceRegistry

Agent Processes: :one_for_one

Agent failures are isolated to individual processes
Failed agents can be restarted by coordination system
Agent state is maintained separately in AgentRegistry

Error Recovery Patterns

Service Unavailability Handling

def get_service_pid(service_name) do
  case ProcessRegistry.lookup({namespace(), service_name}) do
    {:ok, pid} when is_pid(pid) -> 
      {:ok, pid}
    {:error, :not_found} -> 
      # Graceful fallback to ConfigServer
      ConfigServer.get_fallback_config(service_name)
    {:error, :process_dead} ->
      # Wait for supervisor restart
      Process.sleep(100)
      get_service_pid(service_name)  # Retry once
  end
end

Agent Crash Recovery

def handle_info({:DOWN, _ref, :process, pid, reason}, state) do
  agent_id = find_agent_by_pid(state, pid)
  
  case reason do
    :shutdown -> 
      # Expected termination
      update_agent_status(state, agent_id, :stopped)
    
    :normal ->
      # Successful completion
      update_agent_status(state, agent_id, :completed)
      
    _error ->
      # Unexpected crash - mark for potential restart
      update_agent_status(state, agent_id, :crashed)
      maybe_restart_agent(agent_id, reason)
  end
  
  {:noreply, new_state}
end

Process Communication Patterns

Service-to-Service Communication

GenServer.call(ServiceRegistry.get_pid(:config_server), request)
  ↓
ProcessRegistry.lookup({:production, :config_server})
  ↓
GenServer.call(config_server_pid, request)

Agent Coordination Communication

MABEAM.Core → Variable Update → Coordination.broadcast_update(agents)
  ↓
For each agent: GenServer.cast(agent_pid, {:variable_update, variable, value})
  ↓
Agent processes update their internal state asynchronously

Event-Driven Communication

Event Source → EventStore.append(event) → Event Processing →
Telemetry.emit(metrics) → Monitoring Dashboard → Alerts

Performance & Scalability

Process Scaling Characteristics

Foundation Services: Fixed set (10-15 processes) regardless of load
Agent Processes: Scales linearly with workload (1 process per agent)
BEAM Efficiency: Supports millions of lightweight processes

Memory Management

# Agent processes have bounded memory via :max_heap_size
{:ok, pid} = Agent.start_link(fn -> %{} end, [
  max_heap_size: %{size: 1_000_000, kill: true, error_logger: true}
])

CPU Resource Management

Process priorities via Process.flag(:priority, :high) for critical services
CPU quota management through external tools (systemd, Docker limits)
Rate limiting prevents CPU exhaustion from runaway processes

Testing & Development

Namespace Isolation for Testing

# Test configuration uses separate namespace
config :foundation, namespace: :test

# Production configuration  
config :foundation, namespace: :production

Supervision Tree Testing

# Test supervisor startup/shutdown
{:ok, pid} = Foundation.Application.start(:normal, [])
:ok = Foundation.Application.stop(pid)

# Verify all processes terminated
assert ProcessRegistry.count({:test, :all}) == 0

Agent Lifecycle Testing

# Test complete agent lifecycle
{:ok, agent_pid} = MABEAM.AgentRegistry.start_agent(:test_agent)
assert Process.alive?(agent_pid)

:ok = MABEAM.AgentRegistry.stop_agent(:test_agent) 
refute Process.alive?(agent_pid)  # Should be false after stop

Troubleshooting Guide

Common Supervision Issues

Issue: Processes Not Terminating

# WRONG: Using wrong termination method
DynamicSupervisor.terminate_child(supervisor, agent_id)  # agent_id is not PID

# CORRECT: Use PID directly  
DynamicSupervisor.terminate_child(supervisor, agent_pid)  # agent_pid is PID

Issue: GenServer Callbacks in DynamicSupervisor

# WRONG: Mixing behaviors
defmodule AgentSupervisor do
  use DynamicSupervisor
  
  @impl true  # This causes warnings!
  def handle_cast(...), do: ...
end

# CORRECT: Pure DynamicSupervisor
defmodule AgentSupervisor do
  use DynamicSupervisor
  
  # Only DynamicSupervisor callbacks
  def init(opts), do: DynamicSupervisor.init(opts)
end

Issue: Children Not Cleaned Up

# WRONG: Not waiting for termination
MABEAM.AgentSupervisor.stop_agent(pid)
# Immediately check children count - may still be terminating

# CORRECT: Trust OTP guarantees
:ok = MABEAM.AgentSupervisor.stop_agent(pid)
# stop_agent/1 returns :ok only after successful termination

Summary

This process hierarchy provides:

Clear Separation: DynamicSupervisor for process lifecycle, GenServer for state management
Proper OTP Patterns: Following supervision best practices without mixing behaviors
Fault Tolerance: Isolated failures with automatic recovery
Testing Support: Namespace isolation and deterministic cleanup
Performance: Efficient process scaling and resource management

The key insight is maintaining clear boundaries between different OTP behaviors and trusting the built-in guarantees rather than adding artificial synchronization.