← Back to Docs20250627

065 SUPERVISION IMPLEMENTATION GUIDE

Documentation for 065_SUPERVISION_IMPLEMENTATION_GUIDE from the Foundation repository.

OTP Supervision Implementation Guide

Executive Summary

This guide provides detailed implementation patterns for converting unsupervised processes to proper OTP supervision across the Foundation and MABEAM systems. Based on the comprehensive audit findings, we have identified specific patterns and anti-patterns that will ensure reliable, fault-tolerant process management.

Current Supervision Architecture

✅ Existing Supervision Trees

Foundation.Application Supervision Tree

Foundation.Application (Supervisor)
├── Foundation.ProcessRegistry (GenServer)
├── Foundation.ServiceRegistry (GenServer)
├── Foundation.ConfigServer (GenServer)
├── Foundation.EventStore (GenServer)
├── Foundation.TelemetryService (GenServer)
├── Foundation.TaskSupervisor (DynamicSupervisor)
└── Foundation.RateLimiter (GenServer)

MABEAM.Application Supervision Tree

MABEAM.Application (Supervisor)
├── MABEAM.Core (GenServer)
├── MABEAM.AgentRegistry (GenServer)
├── MABEAM.AgentSupervisor (DynamicSupervisor)
├── MABEAM.Coordination (GenServer)
├── MABEAM.LoadBalancer (GenServer)
└── MABEAM.PerformanceMonitor (GenServer)

🔴 Critical Supervision Gaps

Based on OTP_SUPERVISION_AUDIT_findings.md, we have 19 critical instances of unsupervised process creation:

Foundation Layer (8 instances)

  • lib/foundation/application.ex: Lines 505, 510, 891, 896 - Monitoring processes
  • lib/foundation/process_registry.ex: Line 757 - Cleanup processes
  • lib/foundation/coordination/primitives.ex: Lines 650, 678, 687, 737, 743, 788, 794 - Distributed coordination
  • lib/foundation/beam/processes.ex: Line 229 - Memory management tasks

MABEAM Layer (11 instances)

  • lib/mabeam/coordination.ex: Line 912 - Coordination protocol processes
  • lib/mabeam/load_balancer.ex: Line 293 - Load balancing tasks
  • lib/mabeam/comms.ex: Line 311 - Communication processes
  • lib/mabeam/agent.ex: Lines 643, 773 - Agent helper processes

Implementation Patterns

Pattern 1: Convert spawn/1 to Supervised GenServer

❌ Anti-Pattern: Unsupervised spawn

# lib/foundation/application.ex:505
def start_monitoring do
  spawn(fn ->
    schedule_periodic_health_check()
    monitor_loop()
  end)
end

✅ Correct Pattern: Supervised GenServer

# lib/foundation/health_monitor.ex
defmodule Foundation.HealthMonitor do
  use GenServer
  require Logger

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def child_spec(opts) do
    %{
      id: __MODULE__,
      start: {__MODULE__, :start_link, [opts]},
      type: :worker,
      restart: :permanent,
      shutdown: 5000
    }
  end

  @impl true
  def init(opts) do
    interval = Keyword.get(opts, :health_check_interval, 30_000)
    schedule_health_check(interval)
    {:ok, %{interval: interval, checks: %{}}}
  end

  @impl true
  def handle_info(:health_check, state) do
    perform_health_check()
    schedule_health_check(state.interval)
    {:noreply, state}
  end

  defp schedule_health_check(interval) do
    Process.send_after(self(), :health_check, interval)
  end

  defp perform_health_check do
    # Health check implementation
    services = Foundation.ServiceRegistry.list_all_services()
    Enum.each(services, &check_service_health/1)
  end
end

# Add to Foundation.Application children
children = [
  # ... existing children ...
  Foundation.HealthMonitor
]

Pattern 2: Convert Task.start to Task.Supervisor

❌ Anti-Pattern: Unsupervised Task.start

# lib/foundation/beam/processes.ex:229
def cleanup_memory do
  Task.start(fn ->
    :erlang.garbage_collect()
    cleanup_ets_tables()
  end)
end

✅ Correct Pattern: Supervised Task

def cleanup_memory do
  Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
    :erlang.garbage_collect()
    cleanup_ets_tables()
  end)
end

# Or for async with result handling:
def cleanup_memory_async do
  task = Task.Supervisor.async(Foundation.TaskSupervisor, fn ->
    :erlang.garbage_collect()
    cleanup_ets_tables()
  end)
  
  # Handle result asynchronously
  Task.await(task, 30_000)
end

Pattern 3: Convert spawn_link to Proper Process Supervision

❌ Anti-Pattern: spawn_link without supervision

# lib/mabeam/coordination.ex:912
def start_coordination_process(protocol, params) do
  spawn_link(fn ->
    coordination_loop(protocol, params)
  end)
end

✅ Correct Pattern: DynamicSupervisor with proper child specs

# lib/mabeam/coordination_supervisor.ex
defmodule MABEAM.CoordinationSupervisor do
  use DynamicSupervisor

  def start_link(opts) do
    DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl true
  def init(_opts) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  def start_coordination_process(protocol, params) do
    child_spec = {MABEAM.CoordinationWorker, [protocol: protocol, params: params]}
    DynamicSupervisor.start_child(__MODULE__, child_spec)
  end

  def stop_coordination_process(pid) when is_pid(pid) do
    DynamicSupervisor.terminate_child(__MODULE__, pid)
  end
end

# lib/mabeam/coordination_worker.ex
defmodule MABEAM.CoordinationWorker do
  use GenServer

  def start_link(opts) do
    protocol = Keyword.fetch!(opts, :protocol)
    params = Keyword.fetch!(opts, :params)
    GenServer.start_link(__MODULE__, %{protocol: protocol, params: params})
  end

  @impl true
  def init(state) do
    # Initialize coordination process
    {:ok, state}
  end

  @impl true
  def handle_info(:coordinate, state) do
    # Coordination logic here
    coordination_loop(state.protocol, state.params)
    {:noreply, state}
  end
end

# Add to MABEAM.Application children
children = [
  # ... existing children ...
  MABEAM.CoordinationSupervisor
]

Pattern 4: Agent Process Supervision Fix

❌ Current Problem: DynamicSupervisor + GenServer callback mixing

# lib/mabeam/agent_supervisor.ex (current - problematic)
defmodule MABEAM.AgentSupervisor do
  use DynamicSupervisor

  @impl true  # This causes warnings!
  def handle_cast({:start_agent, agent_spec}, state) do
    # Wrong: DynamicSupervisor doesn't support GenServer callbacks
  end
end

✅ Correct Pattern: Separate concerns

# lib/mabeam/agent_supervisor.ex (fixed)
defmodule MABEAM.AgentSupervisor do
  use DynamicSupervisor

  def start_link(opts) do
    DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl true
  def init(_opts) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  # Direct function calls - NO GenServer callbacks
  def start_agent(agent_module, init_args, agent_id) do
    child_spec = {agent_module, [init_args, [name: agent_id]]}
    DynamicSupervisor.start_child(__MODULE__, child_spec)
  end

  def stop_agent(agent_pid) when is_pid(agent_pid) do
    # CORRECT: Use PID directly with DynamicSupervisor
    DynamicSupervisor.terminate_child(__MODULE__, agent_pid)
  end

  def get_running_agents do
    DynamicSupervisor.which_children(__MODULE__)
  end
end

# lib/mabeam/agent_registry.ex (handles metadata and coordination)
defmodule MABEAM.AgentRegistry do
  use GenServer

  # Agent lifecycle coordination - separate from process management
  def register_agent(agent_id, config) do
    GenServer.call(__MODULE__, {:register, agent_id, config})
  end

  def start_agent(agent_id) do
    with {:ok, config} <- get_agent_config(agent_id),
         {:ok, pid} <- MABEAM.AgentSupervisor.start_agent(
           config.module, config.init_args, agent_id
         ),
         :ok <- GenServer.call(__MODULE__, {:agent_started, agent_id, pid}) do
      Process.monitor(pid)  # Monitor for crash detection
      {:ok, pid}
    end
  end

  def stop_agent(agent_id) do
    with {:ok, pid} <- get_agent_pid(agent_id),
         :ok <- MABEAM.AgentSupervisor.stop_agent(pid),
         :ok <- GenServer.call(__MODULE__, {:agent_stopped, agent_id}) do
      :ok
    end
  end

  # GenServer callbacks handle state management
  @impl true
  def handle_info({:DOWN, _ref, :process, pid, _reason}, state) do
    # Update agent status on crash
    agent_id = find_agent_by_pid(state, pid)
    new_state = update_agent_status(state, agent_id, :crashed)
    {:noreply, new_state}
  end
end

Detailed Implementation Plan

Phase 1: Critical Process Supervision (Week 1)

Day 1-2: Foundation Monitoring Processes

Fix lib/foundation/application.ex:505,510,891,896

# Create Foundation.HealthMonitor
defmodule Foundation.HealthMonitor do
  use GenServer
  
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  
  def init(opts) do
    interval = Keyword.get(opts, :check_interval, 30_000)
    schedule_health_check(interval)
    {:ok, %{interval: interval}}
  end
  
  def handle_info(:health_check, state) do
    perform_system_health_check()
    schedule_health_check(state.interval)
    {:noreply, state}
  end
end

# Create Foundation.ServiceMonitor  
defmodule Foundation.ServiceMonitor do
  use GenServer
  
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  
  def init(opts) do
    interval = Keyword.get(opts, :monitor_interval, 10_000)
    schedule_service_monitoring(interval)
    {:ok, %{interval: interval, services: %{}}}
  end
  
  def handle_info(:monitor_services, state) do
    monitor_all_services()
    schedule_service_monitoring(state.interval)
    {:noreply, state}
  end
end

# Update Foundation.Application children
children = [
  Foundation.ProcessRegistry,
  Foundation.ServiceRegistry,
  Foundation.ConfigServer,
  Foundation.EventStore,
  Foundation.TelemetryService,
  Foundation.HealthMonitor,     # NEW
  Foundation.ServiceMonitor,    # NEW
  Foundation.TaskSupervisor,
  Foundation.RateLimiter
]

Day 3: MABEAM Coordination Process

Fix lib/mabeam/coordination.ex:912

# Create MABEAM.CoordinationSupervisor
defmodule MABEAM.CoordinationSupervisor do
  use DynamicSupervisor
  
  def start_link(opts) do
    DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end
  
  @impl true
  def init(_opts) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end
  
  def start_coordination_process(protocol, params) do
    child_spec = {MABEAM.CoordinationWorker, [protocol: protocol, params: params]}
    DynamicSupervisor.start_child(__MODULE__, child_spec)
  end
end

# Update MABEAM.Application children
children = [
  MABEAM.Core,
  MABEAM.AgentRegistry,
  MABEAM.AgentSupervisor,
  MABEAM.Coordination,
  MABEAM.CoordinationSupervisor,  # NEW
  MABEAM.LoadBalancer,
  MABEAM.PerformanceMonitor
]

Day 4-5: Task.start Migration

Fix lib/foundation/beam/processes.ex:229

# Replace Task.start with supervised alternatives
def cleanup_memory do
  Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
    :erlang.garbage_collect()
    cleanup_ets_tables()
    compact_memory()
  end)
end

def cleanup_memory_sync do
  Task.Supervisor.async(Foundation.TaskSupervisor, fn ->
    :erlang.garbage_collect()
    cleanup_ets_tables()
    compact_memory()
  end)
  |> Task.await(30_000)
end

Phase 2: Coordination Primitives (Week 2)

Fix lib/foundation/coordination/primitives.ex (7 instances)

# Replace all spawn calls with supervised tasks
defmodule Foundation.Coordination.Primitives do
  # Lines 650, 678, 687, 737, 743, 788, 794
  
  def start_distributed_lock(lock_id, timeout \\ 30_000) do
    Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
      acquire_distributed_lock(lock_id, timeout)
    end)
  end
  
  def start_leader_election(election_id, candidates) do
    Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
      run_leader_election(election_id, candidates)
    end)
  end
  
  def start_barrier_coordination(barrier_id, participant_count) do
    Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
      coordinate_barrier(barrier_id, participant_count)
    end)
  end
end

Phase 3: Agent Supervision Architecture (Week 3)

Fix MABEAM Agent System

# lib/mabeam/agent_supervisor.ex - Remove GenServer callbacks
defmodule MABEAM.AgentSupervisor do
  use DynamicSupervisor
  
  # ONLY DynamicSupervisor callbacks - NO GenServer callbacks
  def init(_opts) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end
  
  # Direct function calls for agent management
  def start_agent(agent_module, init_args, agent_id) do
    child_spec = {agent_module, [init_args, [name: agent_id]]}
    DynamicSupervisor.start_child(__MODULE__, child_spec)
  end
  
  def stop_agent(agent_pid) when is_pid(agent_pid) do
    # CORRECT: Use PID directly with DynamicSupervisor
    DynamicSupervisor.terminate_child(__MODULE__, agent_pid)
  end
end

# lib/mabeam/agent_registry.ex - Pure GenServer for state management
defmodule MABEAM.AgentRegistry do
  use GenServer
  
  def start_agent(agent_id) do
    with {:ok, config} <- get_agent_config(agent_id),
         {:ok, pid} <- MABEAM.AgentSupervisor.start_agent(
           config.module, config.init_args, agent_id
         ),
         :ok <- GenServer.call(__MODULE__, {:agent_started, agent_id, pid}) do
      Process.monitor(pid)  # Monitor for crash detection
      {:ok, pid}
    end
  end
  
  def stop_agent(agent_id) do
    with {:ok, pid} <- get_agent_pid(agent_id),
         :ok <- MABEAM.AgentSupervisor.stop_agent(pid),
         :ok <- GenServer.call(__MODULE__, {:agent_stopped, agent_id}) do
      :ok
    end
  end
  
  # GenServer callbacks handle state management
  def handle_info({:DOWN, _ref, :process, pid, _reason}, state) do
    # Update agent status on crash
    agent_id = find_agent_by_pid(state, pid)
    new_state = update_agent_status(state, agent_id, :crashed)
    {:noreply, new_state}
  end
end

Testing Supervision Implementation

Supervision Test Patterns

Test Process Lifecycle

defmodule Foundation.HealthMonitorTest do
  use ExUnit.Case
  
  test "health monitor starts and stops cleanly" do
    {:ok, pid} = Foundation.HealthMonitor.start_link([])
    assert Process.alive?(pid)
    
    GenServer.stop(pid)
    refute Process.alive?(pid)
  end
  
  test "health monitor automatically restarts on crash" do
    {:ok, supervisor_pid} = Supervisor.start_link([Foundation.HealthMonitor], strategy: :one_for_one)
    
    [{_, monitor_pid, _, _}] = Supervisor.which_children(supervisor_pid)
    assert Process.alive?(monitor_pid)
    
    # Crash the monitor
    Process.exit(monitor_pid, :kill)
    
    # Supervisor should restart it
    :timer.sleep(100)
    [{_, new_monitor_pid, _, _}] = Supervisor.which_children(supervisor_pid)
    assert Process.alive?(new_monitor_pid)
    assert new_monitor_pid != monitor_pid
  end
end

Test Agent Supervision

defmodule MABEAM.AgentLifecycleTest do
  use ExUnit.Case
  
  test "agent lifecycle with proper supervision" do
    # Start agent through registry
    {:ok, agent_pid} = MABEAM.AgentRegistry.start_agent(:test_agent)
    assert Process.alive?(agent_pid)
    
    # Stop agent gracefully
    :ok = MABEAM.AgentRegistry.stop_agent(:test_agent)
    refute Process.alive?(agent_pid)
    
    # Verify cleanup
    {:error, :not_found} = MABEAM.AgentRegistry.get_agent_pid(:test_agent)
  end
  
  test "agent crash detection and status update" do
    {:ok, agent_pid} = MABEAM.AgentRegistry.start_agent(:test_agent)
    
    # Simulate crash
    Process.exit(agent_pid, :kill)
    
    # Verify status updated
    :timer.sleep(100)  # Allow monitoring to process
    {:ok, status} = MABEAM.AgentRegistry.get_agent_status(:test_agent)
    assert status.current_status == :crashed
  end
end

Success Criteria

Technical Verification

  • Zero spawn, spawn_link, or Task.start calls in lib/ directory
  • All long-running processes under OTP supervision
  • Clean compilation with no callback warnings
  • All tests pass with proper process cleanup

Functional Verification

  • Health monitoring continues after process crashes
  • Agent coordination survives individual agent failures
  • Memory cleanup tasks properly supervised
  • Service monitoring automatically restarts on failure

Reliability Verification

  • System survives stress testing with process crashes
  • No orphaned processes after test runs
  • Graceful shutdown of all supervised processes
  • Automatic recovery from coordination failures

Anti-Patterns to Avoid

❌ Don’t: Mix Supervisor Types

defmodule BadSupervisor do
  use DynamicSupervisor
  
  # Wrong: DynamicSupervisor doesn't support GenServer callbacks
  def handle_call(:get_children, _from, state) do
    # This will cause warnings and unexpected behavior
  end
end

❌ Don’t: Use spawn for Long-Running Processes

# Wrong: Unsupervised long-running process
def start_monitor do
  spawn(fn -> 
    loop() # Infinite loop without supervision
  end)
end

❌ Don’t: Ignore Process Exit Reasons

# Wrong: No handling of different exit reasons
def handle_info({:DOWN, _ref, :process, _pid, _reason}, state) do
  # Should handle different reasons differently
  {:noreply, state}
end

✅ Do: Handle Exit Reasons Properly

def handle_info({:DOWN, _ref, :process, pid, reason}, state) do
  case reason do
    :normal -> handle_normal_exit(pid, state)
    :shutdown -> handle_shutdown(pid, state)
    {:shutdown, _} -> handle_shutdown(pid, state)
    _ -> handle_crash(pid, reason, state)
  end
end

Conclusion

This implementation guide provides concrete patterns for converting all unsupervised processes to proper OTP supervision. The key principles are:

  1. Use appropriate OTP behaviors for different process types
  2. Separate concerns between process lifecycle and state management
  3. Monitor processes for crash detection and status updates
  4. Test supervision behavior to ensure fault tolerance
  5. Follow OTP conventions for restart strategies and child specs

Following these patterns will result in a robust, fault-tolerant system that leverages the full power of OTP supervision for production-grade reliability.