← Back to Docs20250627

029 DISTRIBUTED REVIEW

Documentation for 029_DISTRIBUTED_REVIEW from the Foundation repository.

Codebase Review: Distributed Computing Readiness

Executive Summary

This document assesses the foundation codebase for its readiness to be extended into a distributed system using libcluster and Horde. The current architecture is well-positioned for distribution but is not yet fully distributed. It has been designed with distribution in mind, featuring clear abstractions and a separation of concerns that will greatly simplify the transition. However, several key components are currently implemented with a single-node focus and will require modification to work in a clustered environment.

Core Strengths for Distribution

The codebase has a solid foundation for evolving into a distributed system:

  1. Pluggable Backends (ProcessRegistry.Backend): The decision to define a Backend behavior for the ProcessRegistry is a critical architectural strength. It allows the underlying storage mechanism to be swapped out. The existence of a (placeholder) Horde backend demonstrates foresight. To make the registry distributed, one would simply need to implement the Backend behavior using Horde.Registry and change the application configuration. This is a clean and low-friction path to a distributed service registry.

  2. Service-Oriented Architecture: The codebase is broken down into distinct services (ConfigServer, EventStore, AgentRegistry, etc.), each with a well-defined responsibility. This modularity is essential for distribution, as it allows different services to potentially run on different nodes in the cluster.

  3. Stateless Modules: Many modules, such as Foundation.MABEAM.Agent, are stateless facades that delegate their work to stateful services. This is a good pattern for distribution, as it decouples the business logic from the state, making it easier to reason about where state lives in a cluster.

  4. Abstraction of Communication (Foundation.MABEAM.Comms): The Comms module centralizes inter-agent communication. While the current implementation uses GenServer.call, this module can be extended to use a distributed messaging system (like Horde.Process or standard BEAM distribution) to send messages to agents on remote nodes transparently.

Areas Requiring Modification for Distribution

While the foundation is strong, significant work is needed to make the system fully distributed.

  1. Single-Point-of-Failure GenServers: Several core components are single GenServer processes, which would become bottlenecks or single points of failure in a distributed system. These include:

    • Foundation.MABEAM.Coordination
    • Foundation.MABEAM.Economics
    • Foundation.MABEAM.AgentRegistry
    • Foundation.MABEAM.LoadBalancer
    • Foundation.MABEAM.PerformanceMonitor

    To be truly distributed, these services would need to be re-architected. The best approach would be to use Horde.DynamicSupervisor to distribute the agents/sessions they manage across the cluster, and Horde.Registry to locate them.

  2. Local ETS Tables: The ProcessRegistry.Backend.ETS and Coordination.Primitives use local ETS tables for storage. In a distributed environment, this data would not be shared across nodes. The Horde backend for the process registry is the intended solution here. For the coordination primitives, a distributed consensus tool like Raft (or a library wrapping it) would be needed to manage distributed state.

  3. Hardcoded Local PIDs: The code frequently looks up a service in the registry and then uses its PID for communication. This works on a single node, but in a distributed system, the process may be on a remote node. While Elixir’s distribution makes this transparent, the logic for service discovery needs to be robust to handle remote PIDs. The use of ServiceRegistry and ProcessRegistry abstracts this well, so the primary change would be in the backend implementation.

  4. Node.self() and Node.list(): The Coordination.Primitives module uses Node.self() and Node.list() to determine cluster membership. This is the correct approach, but it assumes that libcluster is configured and running to manage the node list dynamically. The code is ready for libcluster, but the dependency and configuration are not yet present.

  5. RPC Usage: The Coordination.Primitives module uses :rpc.call for inter-node communication. While functional, this is often considered a less robust approach than sending messages to registered processes. A better long-term solution would be to have distributed processes (managed by Horde) that communicate via GenServer.call or cast.

Path to Distribution with libcluster and Horde

  1. Add Dependencies: Add libcluster and horde to the mix.exs file.
  2. Configure libcluster: Configure a clustering strategy (e.g., Gossip) in config/config.exs to allow nodes to discover each other automatically.
  3. Implement Horde Backend: Complete the Foundation.ProcessRegistry.Backend.Horde module, using Horde.Registry for all the Backend callbacks. This will provide a distributed, fault-tolerant service registry.
  4. Refactor Core Services: Re-architect the single-process services (Coordination, Economics, etc.) to be distributed. This is the most significant part of the work. A good pattern would be:
    • Use Horde.DynamicSupervisor to start and supervise agents/sessions across the cluster.
    • Use Horde.Registry to register and discover these distributed processes.
    • The main service module (Foundation.MABEAM.Coordination) would then become a stateless facade that uses the distributed registry to find and communicate with the appropriate worker process, wherever it is in the cluster.
  5. Update Comms Module: Modify the Comms module to use the distributed ProcessRegistry to find agents (which could be on any node) and send messages to them.

Conclusion

The foundation codebase is well-prepared but not yet ready for distributed computing. The architects have clearly planned for a distributed future by creating the right abstractions (pluggable backends, service-oriented design). The path to a fully distributed system is clear, but it will require implementing the Horde backend and refactoring the core stateful services to leverage Horde for distributed supervision and registration. The codebase is in an excellent position to make this transition smoothly.