OTP Supervision Implementation Guide
Executive Summary
This guide provides detailed implementation patterns for converting unsupervised processes to proper OTP supervision across the Foundation and MABEAM systems. Based on the comprehensive audit findings, we have identified specific patterns and anti-patterns that will ensure reliable, fault-tolerant process management.
Current Supervision Architecture
✅ Existing Supervision Trees
Foundation.Application Supervision Tree
Foundation.Application (Supervisor)
├── Foundation.ProcessRegistry (GenServer)
├── Foundation.ServiceRegistry (GenServer)
├── Foundation.ConfigServer (GenServer)
├── Foundation.EventStore (GenServer)
├── Foundation.TelemetryService (GenServer)
├── Foundation.TaskSupervisor (DynamicSupervisor)
└── Foundation.RateLimiter (GenServer)
MABEAM.Application Supervision Tree
MABEAM.Application (Supervisor)
├── MABEAM.Core (GenServer)
├── MABEAM.AgentRegistry (GenServer)
├── MABEAM.AgentSupervisor (DynamicSupervisor)
├── MABEAM.Coordination (GenServer)
├── MABEAM.LoadBalancer (GenServer)
└── MABEAM.PerformanceMonitor (GenServer)
🔴 Critical Supervision Gaps
Based on OTP_SUPERVISION_AUDIT_findings.md, we have 19 critical instances of unsupervised process creation:
Foundation Layer (8 instances)
lib/foundation/application.ex
: Lines 505, 510, 891, 896 - Monitoring processeslib/foundation/process_registry.ex
: Line 757 - Cleanup processeslib/foundation/coordination/primitives.ex
: Lines 650, 678, 687, 737, 743, 788, 794 - Distributed coordinationlib/foundation/beam/processes.ex
: Line 229 - Memory management tasks
MABEAM Layer (11 instances)
lib/mabeam/coordination.ex
: Line 912 - Coordination protocol processeslib/mabeam/load_balancer.ex
: Line 293 - Load balancing taskslib/mabeam/comms.ex
: Line 311 - Communication processeslib/mabeam/agent.ex
: Lines 643, 773 - Agent helper processes
Implementation Patterns
Pattern 1: Convert spawn/1 to Supervised GenServer
❌ Anti-Pattern: Unsupervised spawn
# lib/foundation/application.ex:505
def start_monitoring do
spawn(fn ->
schedule_periodic_health_check()
monitor_loop()
end)
end
✅ Correct Pattern: Supervised GenServer
# lib/foundation/health_monitor.ex
defmodule Foundation.HealthMonitor do
use GenServer
require Logger
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def child_spec(opts) do
%{
id: __MODULE__,
start: {__MODULE__, :start_link, [opts]},
type: :worker,
restart: :permanent,
shutdown: 5000
}
end
@impl true
def init(opts) do
interval = Keyword.get(opts, :health_check_interval, 30_000)
schedule_health_check(interval)
{:ok, %{interval: interval, checks: %{}}}
end
@impl true
def handle_info(:health_check, state) do
perform_health_check()
schedule_health_check(state.interval)
{:noreply, state}
end
defp schedule_health_check(interval) do
Process.send_after(self(), :health_check, interval)
end
defp perform_health_check do
# Health check implementation
services = Foundation.ServiceRegistry.list_all_services()
Enum.each(services, &check_service_health/1)
end
end
# Add to Foundation.Application children
children = [
# ... existing children ...
Foundation.HealthMonitor
]
Pattern 2: Convert Task.start to Task.Supervisor
❌ Anti-Pattern: Unsupervised Task.start
# lib/foundation/beam/processes.ex:229
def cleanup_memory do
Task.start(fn ->
:erlang.garbage_collect()
cleanup_ets_tables()
end)
end
✅ Correct Pattern: Supervised Task
def cleanup_memory do
Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
:erlang.garbage_collect()
cleanup_ets_tables()
end)
end
# Or for async with result handling:
def cleanup_memory_async do
task = Task.Supervisor.async(Foundation.TaskSupervisor, fn ->
:erlang.garbage_collect()
cleanup_ets_tables()
end)
# Handle result asynchronously
Task.await(task, 30_000)
end
Pattern 3: Convert spawn_link to Proper Process Supervision
❌ Anti-Pattern: spawn_link without supervision
# lib/mabeam/coordination.ex:912
def start_coordination_process(protocol, params) do
spawn_link(fn ->
coordination_loop(protocol, params)
end)
end
✅ Correct Pattern: DynamicSupervisor with proper child specs
# lib/mabeam/coordination_supervisor.ex
defmodule MABEAM.CoordinationSupervisor do
use DynamicSupervisor
def start_link(opts) do
DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
end
@impl true
def init(_opts) do
DynamicSupervisor.init(strategy: :one_for_one)
end
def start_coordination_process(protocol, params) do
child_spec = {MABEAM.CoordinationWorker, [protocol: protocol, params: params]}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
def stop_coordination_process(pid) when is_pid(pid) do
DynamicSupervisor.terminate_child(__MODULE__, pid)
end
end
# lib/mabeam/coordination_worker.ex
defmodule MABEAM.CoordinationWorker do
use GenServer
def start_link(opts) do
protocol = Keyword.fetch!(opts, :protocol)
params = Keyword.fetch!(opts, :params)
GenServer.start_link(__MODULE__, %{protocol: protocol, params: params})
end
@impl true
def init(state) do
# Initialize coordination process
{:ok, state}
end
@impl true
def handle_info(:coordinate, state) do
# Coordination logic here
coordination_loop(state.protocol, state.params)
{:noreply, state}
end
end
# Add to MABEAM.Application children
children = [
# ... existing children ...
MABEAM.CoordinationSupervisor
]
Pattern 4: Agent Process Supervision Fix
❌ Current Problem: DynamicSupervisor + GenServer callback mixing
# lib/mabeam/agent_supervisor.ex (current - problematic)
defmodule MABEAM.AgentSupervisor do
use DynamicSupervisor
@impl true # This causes warnings!
def handle_cast({:start_agent, agent_spec}, state) do
# Wrong: DynamicSupervisor doesn't support GenServer callbacks
end
end
✅ Correct Pattern: Separate concerns
# lib/mabeam/agent_supervisor.ex (fixed)
defmodule MABEAM.AgentSupervisor do
use DynamicSupervisor
def start_link(opts) do
DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
end
@impl true
def init(_opts) do
DynamicSupervisor.init(strategy: :one_for_one)
end
# Direct function calls - NO GenServer callbacks
def start_agent(agent_module, init_args, agent_id) do
child_spec = {agent_module, [init_args, [name: agent_id]]}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
def stop_agent(agent_pid) when is_pid(agent_pid) do
# CORRECT: Use PID directly with DynamicSupervisor
DynamicSupervisor.terminate_child(__MODULE__, agent_pid)
end
def get_running_agents do
DynamicSupervisor.which_children(__MODULE__)
end
end
# lib/mabeam/agent_registry.ex (handles metadata and coordination)
defmodule MABEAM.AgentRegistry do
use GenServer
# Agent lifecycle coordination - separate from process management
def register_agent(agent_id, config) do
GenServer.call(__MODULE__, {:register, agent_id, config})
end
def start_agent(agent_id) do
with {:ok, config} <- get_agent_config(agent_id),
{:ok, pid} <- MABEAM.AgentSupervisor.start_agent(
config.module, config.init_args, agent_id
),
:ok <- GenServer.call(__MODULE__, {:agent_started, agent_id, pid}) do
Process.monitor(pid) # Monitor for crash detection
{:ok, pid}
end
end
def stop_agent(agent_id) do
with {:ok, pid} <- get_agent_pid(agent_id),
:ok <- MABEAM.AgentSupervisor.stop_agent(pid),
:ok <- GenServer.call(__MODULE__, {:agent_stopped, agent_id}) do
:ok
end
end
# GenServer callbacks handle state management
@impl true
def handle_info({:DOWN, _ref, :process, pid, _reason}, state) do
# Update agent status on crash
agent_id = find_agent_by_pid(state, pid)
new_state = update_agent_status(state, agent_id, :crashed)
{:noreply, new_state}
end
end
Detailed Implementation Plan
Phase 1: Critical Process Supervision (Week 1)
Day 1-2: Foundation Monitoring Processes
Fix lib/foundation/application.ex:505,510,891,896
# Create Foundation.HealthMonitor
defmodule Foundation.HealthMonitor do
use GenServer
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def init(opts) do
interval = Keyword.get(opts, :check_interval, 30_000)
schedule_health_check(interval)
{:ok, %{interval: interval}}
end
def handle_info(:health_check, state) do
perform_system_health_check()
schedule_health_check(state.interval)
{:noreply, state}
end
end
# Create Foundation.ServiceMonitor
defmodule Foundation.ServiceMonitor do
use GenServer
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def init(opts) do
interval = Keyword.get(opts, :monitor_interval, 10_000)
schedule_service_monitoring(interval)
{:ok, %{interval: interval, services: %{}}}
end
def handle_info(:monitor_services, state) do
monitor_all_services()
schedule_service_monitoring(state.interval)
{:noreply, state}
end
end
# Update Foundation.Application children
children = [
Foundation.ProcessRegistry,
Foundation.ServiceRegistry,
Foundation.ConfigServer,
Foundation.EventStore,
Foundation.TelemetryService,
Foundation.HealthMonitor, # NEW
Foundation.ServiceMonitor, # NEW
Foundation.TaskSupervisor,
Foundation.RateLimiter
]
Day 3: MABEAM Coordination Process
Fix lib/mabeam/coordination.ex:912
# Create MABEAM.CoordinationSupervisor
defmodule MABEAM.CoordinationSupervisor do
use DynamicSupervisor
def start_link(opts) do
DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
end
@impl true
def init(_opts) do
DynamicSupervisor.init(strategy: :one_for_one)
end
def start_coordination_process(protocol, params) do
child_spec = {MABEAM.CoordinationWorker, [protocol: protocol, params: params]}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
end
# Update MABEAM.Application children
children = [
MABEAM.Core,
MABEAM.AgentRegistry,
MABEAM.AgentSupervisor,
MABEAM.Coordination,
MABEAM.CoordinationSupervisor, # NEW
MABEAM.LoadBalancer,
MABEAM.PerformanceMonitor
]
Day 4-5: Task.start Migration
Fix lib/foundation/beam/processes.ex:229
# Replace Task.start with supervised alternatives
def cleanup_memory do
Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
:erlang.garbage_collect()
cleanup_ets_tables()
compact_memory()
end)
end
def cleanup_memory_sync do
Task.Supervisor.async(Foundation.TaskSupervisor, fn ->
:erlang.garbage_collect()
cleanup_ets_tables()
compact_memory()
end)
|> Task.await(30_000)
end
Phase 2: Coordination Primitives (Week 2)
Fix lib/foundation/coordination/primitives.ex (7 instances)
# Replace all spawn calls with supervised tasks
defmodule Foundation.Coordination.Primitives do
# Lines 650, 678, 687, 737, 743, 788, 794
def start_distributed_lock(lock_id, timeout \\ 30_000) do
Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
acquire_distributed_lock(lock_id, timeout)
end)
end
def start_leader_election(election_id, candidates) do
Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
run_leader_election(election_id, candidates)
end)
end
def start_barrier_coordination(barrier_id, participant_count) do
Task.Supervisor.start_child(Foundation.TaskSupervisor, fn ->
coordinate_barrier(barrier_id, participant_count)
end)
end
end
Phase 3: Agent Supervision Architecture (Week 3)
Fix MABEAM Agent System
# lib/mabeam/agent_supervisor.ex - Remove GenServer callbacks
defmodule MABEAM.AgentSupervisor do
use DynamicSupervisor
# ONLY DynamicSupervisor callbacks - NO GenServer callbacks
def init(_opts) do
DynamicSupervisor.init(strategy: :one_for_one)
end
# Direct function calls for agent management
def start_agent(agent_module, init_args, agent_id) do
child_spec = {agent_module, [init_args, [name: agent_id]]}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
def stop_agent(agent_pid) when is_pid(agent_pid) do
# CORRECT: Use PID directly with DynamicSupervisor
DynamicSupervisor.terminate_child(__MODULE__, agent_pid)
end
end
# lib/mabeam/agent_registry.ex - Pure GenServer for state management
defmodule MABEAM.AgentRegistry do
use GenServer
def start_agent(agent_id) do
with {:ok, config} <- get_agent_config(agent_id),
{:ok, pid} <- MABEAM.AgentSupervisor.start_agent(
config.module, config.init_args, agent_id
),
:ok <- GenServer.call(__MODULE__, {:agent_started, agent_id, pid}) do
Process.monitor(pid) # Monitor for crash detection
{:ok, pid}
end
end
def stop_agent(agent_id) do
with {:ok, pid} <- get_agent_pid(agent_id),
:ok <- MABEAM.AgentSupervisor.stop_agent(pid),
:ok <- GenServer.call(__MODULE__, {:agent_stopped, agent_id}) do
:ok
end
end
# GenServer callbacks handle state management
def handle_info({:DOWN, _ref, :process, pid, _reason}, state) do
# Update agent status on crash
agent_id = find_agent_by_pid(state, pid)
new_state = update_agent_status(state, agent_id, :crashed)
{:noreply, new_state}
end
end
Testing Supervision Implementation
Supervision Test Patterns
Test Process Lifecycle
defmodule Foundation.HealthMonitorTest do
use ExUnit.Case
test "health monitor starts and stops cleanly" do
{:ok, pid} = Foundation.HealthMonitor.start_link([])
assert Process.alive?(pid)
GenServer.stop(pid)
refute Process.alive?(pid)
end
test "health monitor automatically restarts on crash" do
{:ok, supervisor_pid} = Supervisor.start_link([Foundation.HealthMonitor], strategy: :one_for_one)
[{_, monitor_pid, _, _}] = Supervisor.which_children(supervisor_pid)
assert Process.alive?(monitor_pid)
# Crash the monitor
Process.exit(monitor_pid, :kill)
# Supervisor should restart it
:timer.sleep(100)
[{_, new_monitor_pid, _, _}] = Supervisor.which_children(supervisor_pid)
assert Process.alive?(new_monitor_pid)
assert new_monitor_pid != monitor_pid
end
end
Test Agent Supervision
defmodule MABEAM.AgentLifecycleTest do
use ExUnit.Case
test "agent lifecycle with proper supervision" do
# Start agent through registry
{:ok, agent_pid} = MABEAM.AgentRegistry.start_agent(:test_agent)
assert Process.alive?(agent_pid)
# Stop agent gracefully
:ok = MABEAM.AgentRegistry.stop_agent(:test_agent)
refute Process.alive?(agent_pid)
# Verify cleanup
{:error, :not_found} = MABEAM.AgentRegistry.get_agent_pid(:test_agent)
end
test "agent crash detection and status update" do
{:ok, agent_pid} = MABEAM.AgentRegistry.start_agent(:test_agent)
# Simulate crash
Process.exit(agent_pid, :kill)
# Verify status updated
:timer.sleep(100) # Allow monitoring to process
{:ok, status} = MABEAM.AgentRegistry.get_agent_status(:test_agent)
assert status.current_status == :crashed
end
end
Success Criteria
Technical Verification
- Zero
spawn
,spawn_link
, orTask.start
calls inlib/
directory - All long-running processes under OTP supervision
- Clean compilation with no callback warnings
- All tests pass with proper process cleanup
Functional Verification
- Health monitoring continues after process crashes
- Agent coordination survives individual agent failures
- Memory cleanup tasks properly supervised
- Service monitoring automatically restarts on failure
Reliability Verification
- System survives stress testing with process crashes
- No orphaned processes after test runs
- Graceful shutdown of all supervised processes
- Automatic recovery from coordination failures
Anti-Patterns to Avoid
❌ Don’t: Mix Supervisor Types
defmodule BadSupervisor do
use DynamicSupervisor
# Wrong: DynamicSupervisor doesn't support GenServer callbacks
def handle_call(:get_children, _from, state) do
# This will cause warnings and unexpected behavior
end
end
❌ Don’t: Use spawn for Long-Running Processes
# Wrong: Unsupervised long-running process
def start_monitor do
spawn(fn ->
loop() # Infinite loop without supervision
end)
end
❌ Don’t: Ignore Process Exit Reasons
# Wrong: No handling of different exit reasons
def handle_info({:DOWN, _ref, :process, _pid, _reason}, state) do
# Should handle different reasons differently
{:noreply, state}
end
✅ Do: Handle Exit Reasons Properly
def handle_info({:DOWN, _ref, :process, pid, reason}, state) do
case reason do
:normal -> handle_normal_exit(pid, state)
:shutdown -> handle_shutdown(pid, state)
{:shutdown, _} -> handle_shutdown(pid, state)
_ -> handle_crash(pid, reason, state)
end
end
Conclusion
This implementation guide provides concrete patterns for converting all unsupervised processes to proper OTP supervision. The key principles are:
- Use appropriate OTP behaviors for different process types
- Separate concerns between process lifecycle and state management
- Monitor processes for crash detection and status updates
- Test supervision behavior to ensure fault tolerance
- Follow OTP conventions for restart strategies and child specs
Following these patterns will result in a robust, fault-tolerant system that leverages the full power of OTP supervision for production-grade reliability.