OTP Design Review for Pipeline System
Executive Summary
This document presents a comprehensive review of the Pipeline system’s OTP (Open Telecom Platform) design, identifying critical issues with unsupervised processes, race conditions, and concurrent programming patterns that could lead to system instability, resource leaks, and data loss.
Critical Issues Identified
1. Unsupervised Processes
1.1 Performance Monitoring GenServer
- Location:
lib/pipeline/monitoring/performance.ex:47
- Issue: GenServer started without supervision
- Risk: Process crashes won’t be recovered, monitoring data lost
- Impact: High - Loss of observability
1.2 Tool Registry Agent
- Location:
lib/pipeline/tools/tool_registry.ex:95
- Issue: Agent with registered name
:tool_registry
not supervised - Risk: Registry becomes unavailable on crash, name collision prevents restart
- Impact: Critical - Tool execution failures
1.3 Unsupervised Task.async Usage
Multiple locations use Task.async
without supervision:
lib/pipeline/step/parallel_claude.ex:22
lib/pipeline/step/loop.ex:192
lib/pipeline/step/claude_batch.ex:225
lib/pipeline/providers/enhanced_claude_provider.ex:120
Risks:
- Orphaned processes on parent crash
- No error isolation
- Resource leaks
- Uncontrolled concurrency
2. Race Conditions
2.1 Process Dictionary Race
- Location:
lib/pipeline/session_manager.ex:292-310
- Issue: Process dictionary used for session storage without synchronization
- Risk: Lost session data in concurrent operations
2.2 ETS Table Creation Race
- Location:
lib/pipeline/prompt_builder.ex:37-51
- Issue: Check-then-create pattern for ETS table
- Risk: While handled with rescue, pattern is problematic
2.3 File System Race
- Location:
lib/pipeline/checkpoint_manager.ex:50-54
- Issue: Delete-then-write pattern for checkpoint files
- Risk: Data loss or corruption during concurrent checkpointing
2.4 GenServer Registration Race
- Location:
lib/pipeline/monitoring/performance.ex:55-76
- Issue: Check existence then perform operations
- Risk:
:noproc
errors if process dies between check and call
3. Dangerous Patterns
3.1 Infinite Timeouts
- Location:
lib/pipeline/step/parallel_claude.ex:40
- Pattern:
Task.await_many(async_tasks, :infinity)
- Risk: Can hang indefinitely, blocking entire pipeline
3.2 Silent Error Handling
- Location:
lib/pipeline/safety/resource_monitor.ex:294-300
- Pattern: Returning 0 on memory check errors
- Risk: Bypasses safety checks, masks critical issues
4. Missing Supervision Tree Components
The application supervision tree only includes a Registry, missing:
- Performance monitoring processes
- Tool registry
- Task supervisor for concurrent operations
- Session management processes
- Safety monitoring components
Recommendations
Immediate Actions (Critical)
- Add Supervision Tree
defmodule Pipeline.Application do
use Application
def start(_type, _args) do
children = [
{Registry, keys: :unique, name: Pipeline.MonitoringRegistry},
{Task.Supervisor, name: Pipeline.TaskSupervisor},
Pipeline.Tools.ToolRegistry,
Pipeline.Monitoring.PerformanceSupervisor,
Pipeline.Session.Supervisor
]
opts = [strategy: :one_for_one, name: Pipeline.Supervisor]
Supervisor.start_link(children, opts)
end
end
- Replace Task.async with Task.Supervisor
# Instead of:
Task.async(fn -> ... end)
# Use:
Task.Supervisor.async(Pipeline.TaskSupervisor, fn -> ... end)
- Fix Race Conditions
- Replace Process dictionary with ETS or GenServer for session storage
- Use atomic file operations or file locking for checkpoints
- Implement proper try-rescue patterns for GenServer calls
Short-term Improvements
Add Circuit Breakers
- Implement circuit breakers for external API calls
- Prevent cascade failures from provider issues
Resource Limits
- Add configurable timeouts (never use
:infinity
) - Implement max concurrent task limits
- Add memory usage caps
- Add configurable timeouts (never use
Error Handling
- Never silently swallow errors
- Log all error conditions
- Implement proper error propagation
Long-term Architecture
Process Architecture
- One supervisor per major component
- Proper restart strategies (
:permanent
,:temporary
,:transient
) - Dynamic supervisors for runtime processes
State Management
- Centralized state management via GenServers
- Consistent use of Registry for process discovery
- Event sourcing for critical state changes
Monitoring & Observability
- Telemetry integration
- Health checks for all processes
- Metrics collection and alerting
Risk Assessment
High Risk Issues
- Tool Registry crash → Complete tool execution failure
- Infinite timeouts → Pipeline hangs
- Unsupervised tasks → Resource leaks
Medium Risk Issues
- Performance monitor crashes → Loss of metrics
- Session data races → Inconsistent state
- File system races → Checkpoint corruption
Low Risk Issues
- ETS table creation race (handled)
- Directory creation races (mkdir_p is safe)
Implementation Priority
- Week 1: Add supervision tree, fix Tool Registry
- Week 2: Replace Task.async with supervised tasks
- Week 3: Fix race conditions in session/checkpoint management
- Week 4: Add circuit breakers and resource limits
- Month 2: Implement comprehensive monitoring and health checks
Conclusion
The Pipeline system has several critical OTP design flaws that need immediate attention. The lack of supervision for key processes and the presence of multiple race conditions pose significant risks to system stability and data integrity. Implementing the recommended changes will greatly improve the system’s reliability, fault tolerance, and maintainability.