Process Dictionary Migration - Operational Runbook
Purpose: Step-by-step procedures for safely migrating from Process dictionary to OTP patterns in production.
Pre-Migration Checklist
System Requirements
- Elixir 1.14+ (for latest Logger metadata features)
- OTP 25+ (for ETS optimizations)
- Monitoring dashboard access
- Rollback procedures tested in staging
- Team notification sent
Baseline Metrics Collection
# Collect 24-hour baseline before migration
mix run scripts/collect_baseline_metrics.exs
# Key metrics to track:
# - Registry operation latency (p50, p95, p99)
# - Error context overhead
# - Memory usage (ETS tables)
# - Process count
# - Telemetry event throughput
Migration Sequence
Stage 1: Enable Monitoring (Day 1)
# In production console
Foundation.FeatureFlags.enable(:enable_migration_monitoring)
# Verify telemetry events
:telemetry.execute([:foundation, :migration, :status], %{stage: 1}, %{})
Monitor for 4 hours:
- Check dashboard for migration telemetry events
- Verify no performance degradation
- Confirm baseline metrics stable
Rollback if needed:
Foundation.FeatureFlags.disable(:enable_migration_monitoring)
Stage 2: Registry Migration (Day 2)
# Enable ETS-based registry
Foundation.FeatureFlags.enable(:use_ets_agent_registry)
# Verify operation
Foundation.Protocols.RegistryAny.list_agents()
# Should return agent list from ETS
Validation Steps:
Register test agent:
test_pid = spawn(fn -> Process.sleep(60_000) end) Foundation.Protocols.RegistryAny.register_agent(:test_migration, test_pid)
Verify retrieval:
{:ok, ^test_pid} = Foundation.Protocols.RegistryAny.get_agent(:test_migration)
Check ETS table:
:ets.info(:foundation_agent_registry) # Should show table statistics
Monitor for 24 hours:
- Registry operation latency < 5ms p99
- No “agent not found” errors increase
- Memory usage stable (ETS table < 10MB)
Rollback if needed:
Foundation.FeatureFlags.disable(:use_ets_agent_registry)
# Agents will revert to Process dictionary
Stage 3: Error Context Migration (Day 3)
# Enable Logger metadata for error context
Foundation.FeatureFlags.enable(:use_logger_error_context)
# Test context propagation
Foundation.ErrorContext.set_context(%{request_id: "test-123"})
Logger.info("Test message")
# Should include error_context in metadata
Validation Steps:
Create nested context:
Foundation.ErrorContext.with_context(%{operation: "validation"}, fn -> Foundation.ErrorContext.with_context(%{step: "parse"}, fn -> Logger.error("Test error with context") end) end)
Verify in logs:
grep "error_context" /var/log/app/current | tail -20
Monitor for 24 hours:
- Log entries include error context
- No context loss during error propagation
- Logger performance acceptable
Rollback if needed:
Foundation.FeatureFlags.disable(:use_logger_error_context)
Stage 4: Telemetry Migration (Days 4-5)
4A: Span Management (Day 4)
# Start span GenServer
{:ok, _} = Foundation.Telemetry.Span.start_link()
# Enable GenServer-based spans
Foundation.FeatureFlags.enable(:use_genserver_span_management)
# Test span creation
span_id = Foundation.Telemetry.Span.start_span("test_operation", %{user_id: 123})
Process.sleep(100)
Foundation.Telemetry.Span.end_span(span_id)
Validation:
- Telemetry events fired for span start/end
- Span duration metrics accurate
- No span stack corruption
4B: Event Deduplication (Day 5)
# Start sampled events GenServer
{:ok, _} = Foundation.Telemetry.SampledEvents.start_link()
# Enable ETS-based deduplication
Foundation.FeatureFlags.enable(:use_ets_sampled_events)
# Test deduplication
Foundation.Telemetry.SampledEvents.emit_event(
[:test, :event],
%{count: 1},
%{dedup_key: "test-123"}
)
Monitor both telemetry components:
- Event throughput maintained
- Deduplication working (no duplicate events)
- Batch processing on schedule
- Memory usage bounded
Rollback if needed:
Foundation.FeatureFlags.disable(:use_genserver_span_management)
Foundation.FeatureFlags.disable(:use_ets_sampled_events)
Stage 5: Full Enforcement (Day 8)
# Enable all OTP patterns
Foundation.FeatureFlags.set(:use_ets_agent_registry, true)
Foundation.FeatureFlags.set(:use_logger_error_context, true)
Foundation.FeatureFlags.set(:use_genserver_telemetry, true)
Foundation.FeatureFlags.set(:use_genserver_span_management, true)
Foundation.FeatureFlags.set(:use_ets_sampled_events, true)
# Enable strict enforcement
Foundation.FeatureFlags.enable(:enforce_no_process_dict)
Final Validation:
Run integration test suite:
mix test test/foundation/otp_cleanup_integration_test.exs
Check for Process dictionary usage:
mix credo --strict # Should show 0 Process dictionary warnings
Performance validation:
Foundation.OTPCleanup.Benchmarks.run_all() # Compare with baseline metrics
Monitoring During Migration
Key Metrics to Watch
System Health
%{ registry_size: length(Foundation.Protocols.RegistryAny.list_agents()), error_rate: ErrorTracker.rate_per_minute(), memory_usage: :erlang.memory(:ets), process_count: length(Process.list()) }
Performance Metrics
- Registry lookup latency (target: < 5ms p99)
- Error context overhead (target: < 100μs)
- Telemetry throughput (target: > 10k events/sec)
- Memory growth rate (target: < 1MB/hour)
Error Patterns
- Watch for “undefined function” errors
- Monitor for ETS table not found errors
- Check for GenServer timeout errors
Alert Thresholds
Configure alerts for:
- Registry lookup latency > 10ms (p99)
- Error rate increase > 20%
- Memory usage > 500MB for ETS
- Process count increase > 50%
Rollback Procedures
Emergency Rollback (All Stages)
# Disable all OTP features immediately
Foundation.OTPCleanup.emergency_rollback!()
# Verify rollback
Foundation.FeatureFlags.all()
|> Enum.filter(fn {k, v} -> String.contains?(to_string(k), "otp") end)
# All should be false
Selective Rollback
For specific component issues:
# Registry issues
Foundation.FeatureFlags.disable(:use_ets_agent_registry)
# Error context issues
Foundation.FeatureFlags.disable(:use_logger_error_context)
# Telemetry issues
Foundation.FeatureFlags.disable(:use_genserver_telemetry)
Foundation.FeatureFlags.disable(:use_ets_sampled_events)
Data Recovery
If ETS tables are corrupted:
# Backup current state
Foundation.OTPCleanup.backup_ets_tables("/tmp/ets_backup")
# Clear and rebuild
Foundation.OTPCleanup.rebuild_ets_tables()
# Restore if needed
Foundation.OTPCleanup.restore_ets_tables("/tmp/ets_backup")
Troubleshooting Guide
Common Issues and Solutions
1. “ETS table not found” errors
# Check if table exists
:ets.whereis(:foundation_agent_registry)
# Recreate if missing
Foundation.Protocols.RegistryAny.ensure_table_exists()
2. High memory usage from ETS
# Check table sizes
:ets.all()
|> Enum.map(fn table -> {:ets.info(table, :name), :ets.info(table, :memory)} end)
|> Enum.sort_by(&elem(&1, 1), :desc)
# Clear stale entries
Foundation.OTPCleanup.cleanup_ets_tables()
3. Logger metadata not propagating
# Verify Logger configuration
Application.get_env(:logger, :metadata)
# Should include :error_context
# Check current metadata
Logger.metadata()
4. GenServer timeouts
# Increase timeout for heavy load
Application.put_env(:foundation, :span_genserver_timeout, 10_000)
# Check GenServer state
:sys.get_state(Foundation.Telemetry.Span)
Debug Commands
# Full system diagnostic
Foundation.OTPCleanup.Diagnostics.run_all()
# Component-specific checks
Foundation.OTPCleanup.Diagnostics.check_registry()
Foundation.OTPCleanup.Diagnostics.check_error_context()
Foundation.OTPCleanup.Diagnostics.check_telemetry()
# Generate diagnostic report
Foundation.OTPCleanup.Diagnostics.generate_report("/tmp/otp_diagnostic.html")
Post-Migration Tasks
1. Remove Legacy Code (Day 10)
Once running stable for 48 hours:
# Remove Process dictionary code paths
mix otp_cleanup.remove_legacy
# Run tests to verify
mix test
2. Update Documentation
- Update API docs to reflect new patterns
- Remove Process dictionary examples
- Add OTP pattern examples
3. Team Training
- Conduct code review of new patterns
- Share performance impact analysis
- Document best practices
4. Metrics Analysis
# Generate migration report
Foundation.OTPCleanup.Reports.generate_migration_summary(
start_date: ~D[2025-07-01],
end_date: ~D[2025-07-10]
)
Contact Information
On-Call Engineer: Check PagerDuty
Migration Lead: [Your Name]
Escalation Path:
- On-call engineer
- Platform team lead
- CTO for emergency rollback approval
Monitoring Dashboards:
- System Health: https://monitoring/dashboards/foundation
- Migration Progress: https://monitoring/dashboards/otp-migration
- ETS Metrics: https://monitoring/dashboards/ets-usage
Documentation:
- Architecture Diagrams:
docsProcessRegistry/ARCHITECTURE.md
- Developer Guide:
docsProcessRegistry/DEVELOPER_GUIDE.md
- Migration Plan:
JULY_1_2025_OTP_CLEANUP_2121.md