← Back to Docs

Otp design review

Documentation for otp_design_review from the Pipeline ex repository.

OTP Design Review for Pipeline System

Executive Summary

This document presents a comprehensive review of the Pipeline system’s OTP (Open Telecom Platform) design, identifying critical issues with unsupervised processes, race conditions, and concurrent programming patterns that could lead to system instability, resource leaks, and data loss.

Critical Issues Identified

1. Unsupervised Processes

1.1 Performance Monitoring GenServer

  • Location: lib/pipeline/monitoring/performance.ex:47
  • Issue: GenServer started without supervision
  • Risk: Process crashes won’t be recovered, monitoring data lost
  • Impact: High - Loss of observability

1.2 Tool Registry Agent

  • Location: lib/pipeline/tools/tool_registry.ex:95
  • Issue: Agent with registered name :tool_registry not supervised
  • Risk: Registry becomes unavailable on crash, name collision prevents restart
  • Impact: Critical - Tool execution failures

1.3 Unsupervised Task.async Usage

Multiple locations use Task.async without supervision:

  • lib/pipeline/step/parallel_claude.ex:22
  • lib/pipeline/step/loop.ex:192
  • lib/pipeline/step/claude_batch.ex:225
  • lib/pipeline/providers/enhanced_claude_provider.ex:120

Risks:

  • Orphaned processes on parent crash
  • No error isolation
  • Resource leaks
  • Uncontrolled concurrency

2. Race Conditions

2.1 Process Dictionary Race

  • Location: lib/pipeline/session_manager.ex:292-310
  • Issue: Process dictionary used for session storage without synchronization
  • Risk: Lost session data in concurrent operations

2.2 ETS Table Creation Race

  • Location: lib/pipeline/prompt_builder.ex:37-51
  • Issue: Check-then-create pattern for ETS table
  • Risk: While handled with rescue, pattern is problematic

2.3 File System Race

  • Location: lib/pipeline/checkpoint_manager.ex:50-54
  • Issue: Delete-then-write pattern for checkpoint files
  • Risk: Data loss or corruption during concurrent checkpointing

2.4 GenServer Registration Race

  • Location: lib/pipeline/monitoring/performance.ex:55-76
  • Issue: Check existence then perform operations
  • Risk: :noproc errors if process dies between check and call

3. Dangerous Patterns

3.1 Infinite Timeouts

  • Location: lib/pipeline/step/parallel_claude.ex:40
  • Pattern: Task.await_many(async_tasks, :infinity)
  • Risk: Can hang indefinitely, blocking entire pipeline

3.2 Silent Error Handling

  • Location: lib/pipeline/safety/resource_monitor.ex:294-300
  • Pattern: Returning 0 on memory check errors
  • Risk: Bypasses safety checks, masks critical issues

4. Missing Supervision Tree Components

The application supervision tree only includes a Registry, missing:

  • Performance monitoring processes
  • Tool registry
  • Task supervisor for concurrent operations
  • Session management processes
  • Safety monitoring components

Recommendations

Immediate Actions (Critical)

  1. Add Supervision Tree
defmodule Pipeline.Application do
  use Application

  def start(_type, _args) do
    children = [
      {Registry, keys: :unique, name: Pipeline.MonitoringRegistry},
      {Task.Supervisor, name: Pipeline.TaskSupervisor},
      Pipeline.Tools.ToolRegistry,
      Pipeline.Monitoring.PerformanceSupervisor,
      Pipeline.Session.Supervisor
    ]

    opts = [strategy: :one_for_one, name: Pipeline.Supervisor]
    Supervisor.start_link(children, opts)
  end
end
  1. Replace Task.async with Task.Supervisor
# Instead of:
Task.async(fn -> ... end)

# Use:
Task.Supervisor.async(Pipeline.TaskSupervisor, fn -> ... end)
  1. Fix Race Conditions
  • Replace Process dictionary with ETS or GenServer for session storage
  • Use atomic file operations or file locking for checkpoints
  • Implement proper try-rescue patterns for GenServer calls

Short-term Improvements

  1. Add Circuit Breakers

    • Implement circuit breakers for external API calls
    • Prevent cascade failures from provider issues
  2. Resource Limits

    • Add configurable timeouts (never use :infinity)
    • Implement max concurrent task limits
    • Add memory usage caps
  3. Error Handling

    • Never silently swallow errors
    • Log all error conditions
    • Implement proper error propagation

Long-term Architecture

  1. Process Architecture

    • One supervisor per major component
    • Proper restart strategies (:permanent, :temporary, :transient)
    • Dynamic supervisors for runtime processes
  2. State Management

    • Centralized state management via GenServers
    • Consistent use of Registry for process discovery
    • Event sourcing for critical state changes
  3. Monitoring & Observability

    • Telemetry integration
    • Health checks for all processes
    • Metrics collection and alerting

Risk Assessment

High Risk Issues

  1. Tool Registry crash → Complete tool execution failure
  2. Infinite timeouts → Pipeline hangs
  3. Unsupervised tasks → Resource leaks

Medium Risk Issues

  1. Performance monitor crashes → Loss of metrics
  2. Session data races → Inconsistent state
  3. File system races → Checkpoint corruption

Low Risk Issues

  1. ETS table creation race (handled)
  2. Directory creation races (mkdir_p is safe)

Implementation Priority

  1. Week 1: Add supervision tree, fix Tool Registry
  2. Week 2: Replace Task.async with supervised tasks
  3. Week 3: Fix race conditions in session/checkpoint management
  4. Week 4: Add circuit breakers and resource limits
  5. Month 2: Implement comprehensive monitoring and health checks

Conclusion

The Pipeline system has several critical OTP design flaws that need immediate attention. The lack of supervision for key processes and the presence of multiple race conditions pose significant risks to system stability and data integrity. Implementing the recommended changes will greatly improve the system’s reliability, fault tolerance, and maintainability.