CLAUDE CODE - OTP Hardening Implementation Plan
Executive Summary
Complete implementation of Stage 3 OTP hardening from AUDIT_02_planSteps_03.md - eliminating all OTP violations through supervised sends, monitor management, and timeout enforcement.
Phase Implementation Plan
PHASE 1: Supervised Send Infrastructure (Critical Priority)
Time: 2 hours
Step 1.1: Create SupervisedSend Module
- Implement
lib/foundation/supervised_send.ex
- Features: delivery monitoring, retry with backoff, circuit breaker integration
- Telemetry events for observability
- Dead letter queue for failed messages
Step 1.2: Create Dead Letter Queue Handler
- Implement
lib/foundation/dead_letter_queue.ex
- Automatic retry with exponential backoff
- Message persistence and observability
- Integration with SupervisedSend
Step 1.3: Integration Tests
- Comprehensive test suite for SupervisedSend
- Dead letter queue behavior verification
- Performance benchmarks vs raw send()
PHASE 2: Critical Send Migration (Critical Priority)
Time: 3 hours
Step 2.1: Coordinator Agent Migration
- Fix 5 raw sends to agent processes in
coordinator_agent.ex
- Add delivery guarantees and error handling
- Preserve self-sends (safe operations)
Step 2.2: Signal Router Migration
- Convert signal delivery to GenServer.cast pattern
- Add backpressure handling and monitoring
- Fallback to supervised send for non-GenServer handlers
Step 2.3: Coordination Patterns Migration
- Fix broadcast operations with delivery tracking
- Implement consensus result distribution
- Add partial failure handling
Step 2.4: Scheduler Manager Migration
- Supervised scheduled task execution
- Retry mechanism for failed deliveries
- Dead letter integration for permanent failures
PHASE 3: Monitor Management System (High Priority)
Time: 2 hours
Step 3.1: MonitorManager Implementation
- Centralized monitor tracking with automatic cleanup
- Caller death detection and orphan cleanup
- Leak detection and reporting
- Telemetry integration
Step 3.2: Monitor Migration Helper
- Automated migration tools for existing monitor usage
- GenServer integration patterns
- Cleanup in terminate callbacks
Step 3.3: Monitor Migration Execution
- Convert all Process.monitor calls to MonitorManager
- Add proper cleanup in process termination
- Verify no monitor leaks remain
PHASE 4: Timeout Enforcement (Medium Priority)
Time: 2 hours
Step 4.1: Timeout Configuration System
- Centralized timeout management
- Service-specific and operation-specific timeouts
- Environment factor support
Step 4.2: GenServer Call Migration
- Automated script to add timeouts to all GenServer.call/2
- Import timeout configuration helpers
- Verify all calls have explicit timeouts
PHASE 5: Testing and Validation (High Priority)
Time: 1 hour
Step 5.1: Integration Test Suite
- End-to-end OTP compliance verification
- Performance regression testing
- Load testing under high concurrency
Step 5.2: Final Verification
- Zero raw sends to other processes
- Zero monitor leaks
- All GenServer calls have timeouts
- All tests passing with zero warnings
Success Criteria
✅ Zero OTP Violations: No raw sends, monitor leaks, or infinite timeouts
✅ Production Ready: Comprehensive error handling and observability
✅ Performance Maintained: No significant regression from hardening
✅ Test Coverage: 100% test coverage on new infrastructure
✅ Clean Code: Zero Credo violations or compiler warnings
Implementation Commitment
This is a continuous implementation session that will complete all 5 phases without stopping. Each phase will be fully implemented, tested, and committed before proceeding to the next.
Total Estimated Time: 10 hours Quality Gates: All tests must pass before each commit Documentation: CLAUDE_CODE_worklog.md updated after each major step