OTP Cleanup Migration Runbook
This runbook provides step-by-step procedures for executing the OTP cleanup migration using the Feature Flag system. It enables safe, gradual transition from Process dictionary to OTP patterns with rollback capability.
Overview
The migration consists of 4 stages, each introducing incremental changes with reduced risk:
- Stage 1 - Registry Migration (Low Risk)
- Stage 2 - Error Context Migration (Medium Risk)
- Stage 3 - Telemetry Migration (High Risk)
- Stage 4 - Enforcement (Critical Risk)
Prerequisites
Before Starting Migration
System Health Check
# Verify system is healthy Foundation.MigrationControl.status() # Check readiness for migration Foundation.MigrationControl.validate_readiness_for_stage(1)
Backup Current State
# Backup database/persistent state if applicable pg_dump production_db > backup_$(date +%Y%m%d_%H%M).sql
Team Notification
- Notify development team
- Ensure rollback person is available
- Set up monitoring alerts
Testing Prerequisites
# Ensure all tests pass mix test # Run credo checks mix credo # Verify no compilation warnings mix compile --warnings-as-errors
Migration Stages
Stage 1: Registry Migration (Low Risk)
What it does: Switches from Process dictionary to ETS-based agent registry.
Flags enabled:
use_ets_agent_registry
enable_migration_monitoring
Execution:
# 1. Validate readiness
Foundation.MigrationControl.validate_readiness_for_stage(1)
# 2. Execute migration
{:ok, result} = Foundation.MigrationControl.migrate_to_stage(1)
# 3. Verify migration
status = Foundation.MigrationControl.status()
IO.inspect(status, label: "Stage 1 Status")
Monitoring Checklist:
- Error rate remains < 1%
- No registry lookup failures
- Agent registration working normally
- ETS memory usage within normal bounds
Success Criteria:
- All agent operations work normally
- No increase in error rates
- Registry lookups succeed consistently
Duration: Monitor for 30 minutes
Stage 2: Error Context Migration (Medium Risk)
What it does: Switches from Process dictionary to Logger metadata for error context.
Flags enabled:
use_logger_error_context
Execution:
# 1. Validate readiness (must be at stage 1)
Foundation.MigrationControl.validate_readiness_for_stage(2)
# 2. Execute migration
{:ok, result} = Foundation.MigrationControl.migrate_to_stage(2)
# 3. Verify error context works
Foundation.ErrorContext.set_context(%{test: "migration_test"})
context = Foundation.ErrorContext.get_context()
Logger.metadata() |> IO.inspect(label: "Logger Metadata")
Monitoring Checklist:
- Error context properly stored in Logger metadata
- Error enrichment working correctly
- No context loss during process transitions
- Structured logging includes context
Success Criteria:
- Error context visible in all log messages
- Context propagation works across processes
- No context-related errors
Duration: Monitor for 2 hours
Stage 3: Telemetry Migration (High Risk)
What it does: Switches from Process dictionary to GenServer/ETS for telemetry operations.
Flags enabled:
use_genserver_telemetry
use_genserver_span_management
use_ets_sampled_events
Execution:
# 1. Validate readiness (must be at stage 2)
Foundation.MigrationControl.validate_readiness_for_stage(3)
# 2. Execute migration
{:ok, result} = Foundation.MigrationControl.migrate_to_stage(3)
# 3. Verify telemetry works
Foundation.Telemetry.Span.with_span(:test_span, %{test: true}) do
Foundation.Telemetry.SampledEvents.emit_event([:test], %{count: 1}, %{source: "migration_test"})
:ok
end
Monitoring Checklist:
- Span stack management working correctly
- No span leaks or orphaned spans
- Event deduplication functioning
- Telemetry performance acceptable
- SpanManager GenServer stable
Success Criteria:
- All span operations work correctly
- No telemetry data loss
- Performance impact < 10%
- SpanManager process stable
Duration: Monitor for 4 hours
Stage 4: Enforcement (Critical Risk)
What it does: Enables strict checking to prevent Process dictionary usage.
Flags enabled:
enforce_no_process_dict
Execution:
# 1. Validate readiness (must be at stage 3)
Foundation.MigrationControl.validate_readiness_for_stage(4)
# 2. Execute migration
{:ok, result} = Foundation.MigrationControl.migrate_to_stage(4)
# 3. Run comprehensive tests
System.cmd("mix", ["test"])
System.cmd("mix", ["credo"])
Monitoring Checklist:
- No Process dictionary violations detected
- All tests passing
- Credo checks clean
- No runtime errors from enforcement
Success Criteria:
- Zero Process dictionary usage violations
- All systems functioning normally
- Clean code quality checks
Duration: Monitor for 24 hours
Rollback Procedures
Automatic Rollback
The system automatically rolls back if:
- Error rate > 5%
- Restart frequency > 10/minute
- Response time degradation > 2x baseline
# Check if automatic rollback triggered
{:rollback_triggered, violations} = Foundation.MigrationControl.check_health_and_rollback()
Manual Rollback
Quick Rollback to Previous Stage
# Rollback one stage
Foundation.MigrationControl.rollback_to_stage(current_stage - 1)
Emergency Rollback (All Stages)
# Complete rollback to stage 0
Foundation.MigrationControl.emergency_rollback("Reason for rollback")
Command Line Emergency Rollback
# If system is unresponsive
./bin/foundation remote_console <<EOF
Foundation.MigrationControl.emergency_rollback("System unresponsive")
EOF
Rollback Verification
After rollback, verify system health:
# Check migration status
status = Foundation.MigrationControl.status()
IO.inspect(status.current_stage, label: "Current Stage")
# Verify system health
health = Foundation.MigrationControl.check_health_and_rollback()
IO.inspect(health, label: "System Health")
# Check all services
Foundation.Services.Supervisor.which_services()
Monitoring and Alerts
Key Metrics to Monitor
Error Rates
# Get current error rate status = Foundation.MigrationControl.status() status.system_health.error_rate
Performance Impact
# Get performance metrics metrics = Foundation.MigrationControl.metrics() metrics.performance_impact
Feature Flag Usage
# Monitor flag state Foundation.FeatureFlags.list_all()
Telemetry Events
The system emits telemetry events for monitoring:
[:foundation, :migration, :stage_enabled]
[:foundation, :migration, :rollback]
[:foundation, :migration, :emergency_rollback]
[:foundation, :feature_flag, :changed]
Health Monitoring
Enable continuous health monitoring:
# Start health monitoring (checks every 30 seconds)
Foundation.MigrationControl.start_health_monitoring(30_000)
Troubleshooting
Common Issues
1. Agent Registry Issues (Stage 1)
Symptoms: Agent registration/lookup failures Solution:
# Check ETS table status
Foundation.Protocols.RegistryETS.status()
# Restart registry if needed
Supervisor.restart_child(Foundation.Services.Supervisor, Foundation.Protocols.RegistryETS)
2. Error Context Missing (Stage 2)
Symptoms: Error context not in logs Solution:
# Check Logger metadata
Logger.metadata() |> IO.inspect()
# Verify flag is enabled
Foundation.FeatureFlags.enabled?(:use_logger_error_context)
3. Span Stack Issues (Stage 3)
Symptoms: Span warnings, memory leaks Solution:
# Check SpanManager status
Foundation.Telemetry.SpanManager.get_stack()
# Restart SpanManager if needed
Supervisor.restart_child(Foundation.Services.Supervisor, Foundation.Telemetry.SpanManager)
4. Process Dictionary Violations (Stage 4)
Symptoms: Credo warnings, runtime errors Solution:
# Find violations
grep -r "Process\.put\|Process\.get" lib/ --include="*.ex"
# Run credo check
mix credo --strict
Emergency Procedures
System Unresponsive
- Connect to remote console
- Execute emergency rollback
- Restart application if needed
- Verify system recovery
Data Loss Concerns
- Immediate emergency rollback
- Check data integrity
- Restore from backup if needed
- Investigate root cause
Performance Degradation
- Check system health metrics
- Consider partial rollback
- Monitor resource usage
- Scale if necessary
Post-Migration
After Each Stage
Update Documentation
- Record lessons learned
- Update monitoring dashboards
- Document any issues encountered
Performance Analysis
# Generate performance report metrics = Foundation.MigrationControl.metrics() # Check for regressions Foundation.MigrationControl.status().system_health
Team Communication
- Notify team of successful migration
- Share any observations
- Plan next stage if applicable
After Complete Migration (Stage 4)
Remove Legacy Code (after 1 week stable)
- Delete Process dictionary fallback code
- Remove temporary compatibility layers
- Clean up unused modules
Update CI/CD
# Enable strict Process dictionary checking grep -r "Process\.put\|Process\.get" lib/ --include="*.ex" && exit 1
Documentation Updates
- Update architecture diagrams
- Revise API documentation
- Update operational runbooks
Celebrate Success! 🎉
- The system is now fully OTP-compliant
- Performance and reliability improved
- Technical debt eliminated
Configuration
Feature Flag Configuration
Set default flags in config/config.exs
:
config :foundation, :feature_flags, %{
use_ets_agent_registry: false,
use_logger_error_context: false,
use_genserver_telemetry: false,
use_genserver_span_management: false,
use_ets_sampled_events: false,
enforce_no_process_dict: false,
enable_migration_monitoring: false
}
Environment-Specific Settings
# config/prod.exs - Start with conservative settings
config :foundation, :feature_flags, %{
enable_migration_monitoring: true
}
# config/staging.exs - Can be more aggressive
config :foundation, :feature_flags, %{
use_ets_agent_registry: true,
use_logger_error_context: true,
enable_migration_monitoring: true
}
Support Contacts
- Development Team: [email protected]
- Operations Team: [email protected]
- On-call Engineer: +1-555-ON-CALL
Appendix
Useful Commands
# Quick status check
Foundation.MigrationControl.status() |> Map.take([:current_stage, :flags])
# Full system health
Foundation.MigrationControl.status().system_health
# Migration history
Foundation.MigrationControl.status().migration_history |> Enum.take(5)
# Flag status
Foundation.FeatureFlags.list_all()
# Emergency rollback
Foundation.MigrationControl.emergency_rollback("Reason")
Reference Links
Remember: This migration eliminates fundamental anti-patterns and aligns the system with 30+ years of battle-tested Erlang/OTP principles. Take your time, monitor carefully, and don’t hesitate to rollback if anything looks suspicious.