PHASE2 CHECKLIST

Documentation for PHASE2_CHECKLIST from the Dspex repository.

Phase 2: Worker Lifecycle Management - Implementation Checklist

Overview

Phase 2 focuses on implementing proper worker lifecycle management, including state machines, health monitoring, and graceful shutdown mechanisms.

New Files to Create

lib/dspex/python_bridge/pool_worker_state.ex
- Purpose: Define worker state machine with proper state transitions
- States: :initializing, :ready, :busy, :draining, :terminated
lib/dspex/python_bridge/pool_health_monitor.ex
- Purpose: Monitor worker health and trigger recycling when needed
- Features: Periodic health checks, error tracking, automatic recovery
test/pool_worker_lifecycle_test.exs
- Purpose: Comprehensive tests for worker lifecycle transitions
- Coverage: State transitions, health checks, recycling policies

Existing Files to Modify

lib/dspex/python_bridge/pool_worker_v2.ex
- Expected changes:
  - Integrate state machine for proper lifecycle management
  - Add health check callbacks
  - Implement graceful shutdown with session draining
  - Add worker recycling based on age/usage
lib/dspex/python_bridge/session_pool_v2.ex
- Expected changes:
  - Track worker states in pool metadata
  - Implement worker recycling policies
  - Add health monitoring integration
  - Handle worker state transitions
lib/dspex/python_bridge/pool_supervisor_v2.ex
- Expected changes:
  - Add health monitor to supervision tree
  - Configure restart strategies for different failure modes
  - Implement progressive backoff for worker restarts

Tests to Write

Worker State Transition Tests
- Test all valid state transitions
- Verify invalid transitions are rejected
- Test concurrent state changes
Health Monitoring Tests
- Worker health check success/failure scenarios
- Automatic recycling triggers
- Health status reporting
Graceful Shutdown Tests
- Session draining during shutdown
- Timeout handling for long-running operations
- Clean resource cleanup
Worker Recycling Tests
- Age-based recycling
- Usage-based recycling
- Error threshold recycling
- Pool size maintenance during recycling

Dependencies on Phase 1

All Phase 1 fixes are complete and required for Phase 2:

✅ NimblePool return values fixed
✅ Port validation implemented
✅ Test assertions corrected
✅ Test guards added
✅ Service detection improved

Main Risk Areas

State Synchronization
- Challenge: Keeping worker state consistent between pool and worker process
- Mitigation: Use gen_statem or similar for strict state management
Race Conditions
- Challenge: Worker might change state during checkout/checkin
- Mitigation: Proper locking and atomic state transitions
Resource Leaks
- Challenge: Ensuring ports and processes are cleaned up properly
- Mitigation: Comprehensive cleanup in terminate callbacks
Performance Impact
- Challenge: Health checks and state tracking add overhead
- Mitigation: Configurable check intervals, efficient state storage
Backward Compatibility
- Challenge: Maintaining compatibility with existing pool users
- Mitigation: Keep external API unchanged, internal refactoring only

Success Criteria

Workers transition through states correctly
Unhealthy workers are automatically recycled
Graceful shutdown completes within timeout
No resource leaks during normal operation
Pool maintains target size during recycling
All existing tests continue to pass

Implementation Order

Implement worker state machine
Add health monitoring
Implement graceful shutdown
Add worker recycling policies
Comprehensive testing
Performance optimization

Configuration Options to Add

config :dspex, :pool_worker_lifecycle,
  health_check_interval: 30_000,      # 30 seconds
  max_worker_age: 3_600_000,          # 1 hour
  max_worker_requests: 1000,          # requests before recycling
  error_threshold: 5,                 # errors before recycling
  shutdown_timeout: 30_000,           # 30 seconds for graceful shutdown
  enable_health_monitoring: true      # can be disabled for testing