Foundation.ProcessRegistry Formal Specification
Phase 0: Complete System Specification (Design-Only)
Executive Summary
The Foundation.ProcessRegistry is a critical infrastructure component providing agent-aware process discovery and coordination for multi-agent BEAM systems. This specification defines the formal mathematical model, invariants, safety properties, and performance guarantees that the implementation must satisfy.
Core Responsibility: Maintain a consistent, high-performance mapping from process identifiers to live processes with rich metadata, enabling sophisticated agent coordination patterns.
1. Domain Modeling and Formal Specifications
1.1 State Invariants (Must Always Be True)
ProcessRegistry State Invariants:
I1. Registry Consistency:
â process_id â registry_table:
process_id â (pid, metadata) â§ Process.alive?(pid) = true
I2. Index Consistency:
â capability â capability_index:
capability â process_id â
â (pid, metadata) : registry_table[process_id] = (pid, metadata) â§
capability â metadata.capabilities
I3. Namespace Isolation:
â p1, p2 â registry_table:
p1.namespace â p2.namespace â
p1.local_id = p2.local_id â§ p1.node = p2.node is valid
I4. Metadata Schema Compliance:
â (pid, metadata) â registry_table:
metadata conforms to agent_metadata() typespec
I5. Monitor Consistency:
â monitor_ref â monitors:
monitor_ref â process_id â
â (pid, metadata) : registry_table[process_id] = (pid, metadata)
I6. Dead Process Cleanup:
â process_id â registry_table:
Process.alive?(registry_table[process_id].pid) = false â
process_id will be removed within cleanup_window_ms
1.2 Safety Properties (Must Never Happen)
ProcessRegistry Safety Properties:
S1. No Dual Registration:
â process_id : |{(pid, metadata) : registry_table[process_id] = (pid, metadata)}| > 1
S2. No Ghost Processes:
â process_id â registry_table : Process.alive?(registry_table[process_id].pid) = false
(except during cleanup_window_ms)
S3. No Index Corruption:
â inconsistent state where index entries point to non-existent registry entries
S4. No Memory Leaks:
registry_memory_usage âĪ live_process_count à average_metadata_size + bounded_overhead
S5. No Race Condition Corruption:
Concurrent operations cannot leave registry in inconsistent state
S6. No Silent Failures:
All registration/lookup failures must return explicit error tuples
S7. No Namespace Pollution:
Process in namespace N cannot be discovered via queries for namespace M where N â M
1.3 Liveness Properties (Must Eventually Happen)
ProcessRegistry Liveness Properties:
L1. Registration Completion:
â register(process_id, pid, metadata) request:
request completes with {:ok} or {:error, reason} within registration_timeout_ms
L2. Lookup Response:
â lookup(process_id) request:
request returns result within lookup_timeout_ms
L3. Dead Process Cleanup:
â process termination event:
cleanup completes within cleanup_window_ms
L4. System Responsiveness:
Registry remains responsive to new requests even under maximum designed load
L5. Index Convergence:
After metadata updates, index queries reflect changes within index_consistency_window_ms
L6. Monitor Processing:
Process death notifications are processed within monitor_processing_window_ms
1.4 Performance Guarantees
ProcessRegistry Performance Specifications:
P1. Lookup Performance:
lookup(process_id) completes in O(1) time regardless of registry size
Target: < 1ms for registries up to 100,000 processes
P2. Registration Performance:
register(process_id, pid, metadata) completes in O(k) time
where k = number of capabilities in metadata
Target: < 5ms for typical agent metadata
P3. Query Performance:
find_by_capability(capability) completes in O(n) time
where n = number of processes with that capability
Target: < 10ms for up to 1,000 matching processes
P4. Memory Efficiency:
Total memory usage âĪ process_count à (base_entry_size + metadata_size) à memory_overhead_factor
where memory_overhead_factor âĪ 2.0
P5. Concurrent Access:
Registry supports unlimited concurrent readers with no blocking
Registry supports up to concurrent_write_limit writers with bounded blocking
P6. Cleanup Performance:
Dead process cleanup completes in O(k) time where k = number of capabilities
Target: < 2ms per cleanup operation
2. Data Structure Specifications
2.1 Process Registry Entry
ProcessEntry ::= {
process_id: {namespace(), node(), local_id()},
pid: pid(),
metadata: AgentMetadata,
registered_at: MonotonicTimestamp,
last_updated: MonotonicTimestamp
}
Mathematical Properties:
- process_id must be unique within registry
- pid must be alive when entry exists
- registered_at âĪ last_updated
- namespace cannot be nil
- metadata must satisfy agent_metadata() typespec
2.2 Agent Metadata Structure
AgentMetadata ::= %{
type: AgentType, // Required for type queries
capabilities: [Capability], // Required for capability queries
resources: ResourceMap, // Optional resource tracking
coordination_variables: [VariableName], // Optional coordination context
health_status: HealthStatus, // Required for health queries
node_affinity: [Node], // Optional distribution hint
created_at: DateTime, // Required audit trail
last_health_check: DateTime, // Required health tracking
custom: Map // Optional extensibility
}
Type Constraints:
- type â {:agent, :service, :coordinator, :resource_manager}
- capabilities â valid_capability_atoms()
- health_status â {:healthy, :degraded, :unhealthy}
- created_at âĪ last_health_check âĪ current_time()
2.3 Index Structure Design
Index Architecture:
Main Registry Table:
Key: process_id = {namespace, node, local_id}
Value: {pid, metadata}
Properties: unique keys, read_concurrency, write_concurrency
Capability Index:
Key: capability (atom)
Value: process_id
Properties: duplicate_bag (one capability -> many processes)
Type Index:
Key: agent_type
Value: process_id
Properties: duplicate_bag (one type -> many processes)
Health Index:
Key: health_status
Value: process_id
Properties: duplicate_bag (one status -> many processes)
Monitor Map:
Key: monitor_reference
Value: process_id
Properties: unique keys, in-memory GenServer state
3. Concurrency Model Verification
3.1 Read Operations (Lockfree)
Read Operation Concurrency Model:
Lookup Operations:
âââ lookup(process_id)
â - Lock-free ETS :lookup operation
â - Process.alive/1 validation (lockfree)
â - Atomic read of consistent state
â - No writer blocking
âââ find_by_capability(capability)
â - Lock-free ETS :lookup on capability_index
â - Lock-free main table lookup for each result
â - Process.alive/1 validation for each
â - Consistent snapshot semantics
âââ Statistical queries (list_all, find_by_type, find_by_health)
- Lock-free ETS pattern matching
- Eventual consistency guarantees
- No blocking on concurrent writes
Consistency Guarantee: All read operations return point-in-time consistent snapshots
3.2 Write Operations (Synchronized)
Write Operation Concurrency Model:
Registration Flow:
1. GenServer.call ensures serialized access to state
2. ETS :lookup for existence check (atomic)
3. ETS :insert operations (atomic per table)
4. Process.monitor (atomic)
5. State update (atomic within GenServer)
Ordering Guarantee: namespace â process_id â metadata â indexes
Atomicity: Either all ETS updates succeed or none (via GenServer serialization)
Cleanup Flow (Triggered by process death):
1. Process DOWN message received
2. Lookup process_id from monitor_ref
3. Atomic removal from all ETS tables
4. Monitor cleanup
5. State update
Deadlock Prevention: Single lock ordering, no nested GenServer calls
3.3 Process Monitoring Strategy
Process Lifecycle Management:
Monitor Creation:
- Process.monitor(pid) called immediately after registration
- monitor_ref stored in GenServer state
- Automatic cleanup on process termination
Death Detection:
- {:DOWN, monitor_ref, :process, pid, reason} message
- O(1) lookup of process_id via monitor_ref
- Immediate cleanup of all indexes
Race Condition Handling:
- Registration of dead process: prevented by Process.alive/1 check
- Death during registration: cleanup handles orphaned entries
- Concurrent death and unregistration: idempotent cleanup operations
4. Failure Mode Analysis with Recovery Guarantees
4.1 Process Registry Crash Recovery
Failure Scenario: ProcessRegistry GenServer crashes
Impact Analysis:
- All ETS tables are lost (linked to GenServer process)
- All process monitors are lost
- Registry becomes unavailable
Recovery Strategy:
1. ProcessRegistry restarts via OTP supervision
2. ETS tables recreated with same configuration
3. No automatic state reconstruction (by design)
4. Processes must re-register themselves
Recovery Time: < 100ms for restart + re-registration
Data Loss: Acceptable (processes should re-register on startup)
Availability: Brief unavailability during restart
Prevention: Supervisor with :permanent restart, exponential backoff
4.2 Memory Exhaustion
Failure Scenario: Registry memory usage exceeds system limits
Early Warning Indicators:
- ETS table size monitoring
- Memory usage growth rate tracking
- Process registration rate alerts
Response Strategy:
1. Immediate: Stop accepting new registrations
2. Short-term: Force cleanup of stale entries
3. Medium-term: Implement registration limits per namespace
4. Long-term: Scale to clustered deployment
Graceful Degradation:
- Core functionality (lookup) remains available
- New registrations fail with {:error, :registry_full}
- Cleanup prioritizes by process age and health status
Recovery: Automatic when memory usage drops below threshold
4.3 Network Partition (Future)
Failure Scenario: Network partition in distributed deployment
Current Behavior: Single-node deployment unaffected
Future Distributed Behavior:
- Each partition maintains local consistency
- No split-brain registry corruption
- Partition healing triggers reconciliation
Reconciliation Strategy:
- Process liveness verification across nodes
- Conflict resolution: prefer local registrations
- Metadata merge with last-writer-wins semantics
Design Consideration: Process IDs include node() for partition tolerance
4.4 Concurrent Write Storm
Failure Scenario: Massive concurrent registration requests
Protection Mechanisms:
1. GenServer serialization prevents corruption
2. ETS write_concurrency enables parallel index updates
3. Process.monitor rate limiting (system-level)
4. Registration timeout prevents indefinite queuing
Backpressure Strategy:
- Queue depth monitoring in GenServer
- Adaptive timeout adjustment
- Circuit breaker pattern for protection
- Graceful rejection with {:error, :system_busy}
Performance Degradation:
- Linear degradation with request rate
- No corruption or data loss
- Consistent response to overload
5. API Contract Specifications
5.1 Registration Contracts
@contract register(process_id(), pid(), agent_metadata()) :: :ok | {:error, reason()}
where:
requires:
- Process.alive?(pid) = true
- valid_process_id?(process_id) = true
- valid_metadata?(metadata) = true
ensures:
- lookup(process_id) returns {:ok, pid, metadata}
- process appears in relevant indexes
- process death triggers automatic cleanup
exceptional:
- {:error, :already_registered} when process_id already exists
- {:error, :invalid_metadata} when metadata invalid
- {:error, :dead_process} when Process.alive?(pid) = false
timing:
- completes within 10ms under normal load
- completes within 50ms under high load
concurrency:
- safe for unlimited concurrent calls
- serialized within single registry instance
5.2 Lookup Contracts
@contract lookup(process_id()) :: {:ok, pid(), metadata()} | :error
where:
requires:
- valid_process_id?(process_id) = true
ensures:
- if {:ok, pid, metadata} then Process.alive?(pid) = true
- if :error then process_id not in registry or process dead
exceptional:
- :error for non-existent or dead processes
timing:
- completes within 1ms under normal load
- completes within 5ms under extreme load
concurrency:
- completely lock-free
- unlimited concurrent readers
- never blocks on writers
5.3 Query Contracts
@contract find_by_capability(capability()) :: {:ok, [registry_entry()]}
where:
requires:
- is_atom(capability) = true
ensures:
- all returned processes have capability in metadata.capabilities
- all returned pids are alive
- results represent consistent point-in-time snapshot
timing:
- O(n) where n = number of processes with capability
- completes within 10ms for up to 1000 matching processes
concurrency:
- lock-free operation
- eventually consistent with concurrent metadata updates
6. Integration Point Mappings
6.1 Foundation Layer Abstractions
ProcessRegistry Integration Architecture:
Foundation.ProcessRegistry â MABEAM.AgentRegistry
âââ Mapping: process metadata â agent capabilities
âââ Preserves: process lifecycle semantics + OTP supervision
âââ Guarantees: agent discovery consistency
âââ Extensions: coordination variable tracking
Foundation.ProcessRegistry â Foundation.ServiceRegistry
âââ Mapping: generic processes â service discovery
âââ Preserves: namespace isolation + health status
âââ Guarantees: service availability tracking
âââ Extensions: load balancing hints
Foundation.ProcessRegistry â Foundation.TelemetryService
âââ Mapping: registration events â telemetry metrics
âââ Preserves: process lifecycle events + metadata changes
âââ Guarantees: metric accuracy and timeliness
âââ Extensions: performance monitoring integration
6.2 Performance Integration Requirements
Integration Performance Specifications:
MABEAM Agent Coordination:
- Agent capability queries: < 5ms for coordination decisions
- Agent health updates: < 2ms for real-time monitoring
- Multi-agent discovery: < 10ms for teams of up to 100 agents
Service Discovery:
- Service availability checks: < 1ms for critical path
- Service metadata retrieval: < 2ms for configuration
- Service health monitoring: < 500ms polling interval
Telemetry Integration:
- Registration event emission: < 100Ξs overhead
- Metadata change events: < 200Ξs overhead
- Query performance metrics: < 50Ξs measurement overhead
7. Success Criteria and Verification
7.1 Formal Verification Requirements
Verification Checklist:
⥠All invariants I1-I6 verified via property-based testing
⥠All safety properties S1-S7 verified via chaos engineering
⥠All liveness properties L1-L6 verified via load testing
⥠All performance guarantees P1-P6 empirically validated
⥠Concurrency model verified via concurrent property testing
⥠Failure recovery verified via fault injection testing
⥠API contracts verified via contract testing framework
7.2 Implementation Compliance Gates
Quality Gates for Implementation:
Phase 1 Gate: Contract Implementation
- Mock implementation satisfies all API contracts
- Property-based tests validate all invariants
- Integration tests verify MABEAM compatibility
Phase 2 Gate: Production Implementation
- Performance benchmarks meet all specifications
- Chaos testing validates failure recovery
- Load testing confirms concurrency guarantees
- Memory stability verified over 24-hour continuous operation
Phase 3 Gate: Production Readiness
- Security audit passes
- Documentation complete and accurate
- Monitoring and alerting configured
- Runbook procedures verified
8. Conclusion
This specification provides the mathematical foundation for a production-grade ProcessRegistry that serves as the cornerstone of Foundation’s multi-agent coordination infrastructure. The formal invariants, safety properties, and performance guarantees ensure the implementation will be reliable, efficient, and scalable.
Next Steps: Proceed to Phase 1 (Contract Implementation with Verification) only after this specification is reviewed and approved.
Critical Success Factor: Implementation must provably satisfy every specification in this document, not just pass existing tests.