Phase 2: Worker Lifecycle Management - Implementation Checklist
Overview
Phase 2 focuses on implementing proper worker lifecycle management, including state machines, health monitoring, and graceful shutdown mechanisms.
New Files to Create
lib/dspex/python_bridge/pool_worker_state.ex
- Purpose: Define worker state machine with proper state transitions
- States: :initializing, :ready, :busy, :draining, :terminated
lib/dspex/python_bridge/pool_health_monitor.ex
- Purpose: Monitor worker health and trigger recycling when needed
- Features: Periodic health checks, error tracking, automatic recovery
test/pool_worker_lifecycle_test.exs
- Purpose: Comprehensive tests for worker lifecycle transitions
- Coverage: State transitions, health checks, recycling policies
Existing Files to Modify
lib/dspex/python_bridge/pool_worker_v2.ex
- Expected changes:
- Integrate state machine for proper lifecycle management
- Add health check callbacks
- Implement graceful shutdown with session draining
- Add worker recycling based on age/usage
- Expected changes:
lib/dspex/python_bridge/session_pool_v2.ex
- Expected changes:
- Track worker states in pool metadata
- Implement worker recycling policies
- Add health monitoring integration
- Handle worker state transitions
- Expected changes:
lib/dspex/python_bridge/pool_supervisor_v2.ex
- Expected changes:
- Add health monitor to supervision tree
- Configure restart strategies for different failure modes
- Implement progressive backoff for worker restarts
- Expected changes:
Tests to Write
Worker State Transition Tests
- Test all valid state transitions
- Verify invalid transitions are rejected
- Test concurrent state changes
Health Monitoring Tests
- Worker health check success/failure scenarios
- Automatic recycling triggers
- Health status reporting
Graceful Shutdown Tests
- Session draining during shutdown
- Timeout handling for long-running operations
- Clean resource cleanup
Worker Recycling Tests
- Age-based recycling
- Usage-based recycling
- Error threshold recycling
- Pool size maintenance during recycling
Dependencies on Phase 1
All Phase 1 fixes are complete and required for Phase 2:
- ✅ NimblePool return values fixed
- ✅ Port validation implemented
- ✅ Test assertions corrected
- ✅ Test guards added
- ✅ Service detection improved
Main Risk Areas
State Synchronization
- Challenge: Keeping worker state consistent between pool and worker process
- Mitigation: Use gen_statem or similar for strict state management
Race Conditions
- Challenge: Worker might change state during checkout/checkin
- Mitigation: Proper locking and atomic state transitions
Resource Leaks
- Challenge: Ensuring ports and processes are cleaned up properly
- Mitigation: Comprehensive cleanup in terminate callbacks
Performance Impact
- Challenge: Health checks and state tracking add overhead
- Mitigation: Configurable check intervals, efficient state storage
Backward Compatibility
- Challenge: Maintaining compatibility with existing pool users
- Mitigation: Keep external API unchanged, internal refactoring only
Success Criteria
- Workers transition through states correctly
- Unhealthy workers are automatically recycled
- Graceful shutdown completes within timeout
- No resource leaks during normal operation
- Pool maintains target size during recycling
- All existing tests continue to pass
Implementation Order
- Implement worker state machine
- Add health monitoring
- Implement graceful shutdown
- Add worker recycling policies
- Comprehensive testing
- Performance optimization
Configuration Options to Add
config :dspex, :pool_worker_lifecycle,
health_check_interval: 30_000, # 30 seconds
max_worker_age: 3_600_000, # 1 hour
max_worker_requests: 1000, # requests before recycling
error_threshold: 5, # errors before recycling
shutdown_timeout: 30_000, # 30 seconds for graceful shutdown
enable_health_monitoring: true # can be disabled for testing