← Back to Docs20250627

030 GEMINI REVIEW 20250625 1630 02 SINGLETON TO DISTRO

Documentation for 030_GEMINI_REVIEW_20250625_1630_02_SINGLETON_TO_DISTRO from the Foundation repository.

Excellent question. This gets to the core of the “pay now vs. pay later” debate in software architecture. You’ve correctly identified that the Coordination and Economics services are the primary hurdles for distribution.

Here is a detailed case study on what migrating these singleton services would involve, directly addressing your questions about difficulty and path reasonableness.

Case Study: Migrating Singleton Services to a Distributed Architecture

Executive Summary

The current architecture provides a sufficiently reasonable path to distribution. While the refactoring effort is non-trivial, it is not significantly harder than building it distributed from the start. The existing abstractions, particularly the ProcessRegistry backend, are a crucial down payment on future distribution that localizes the required changes. Migrating later is a sound strategy that avoids premature complexity.


1. The “As-Is” Architecture: Singleton Manager

Let’s analyze how a service like Foundation.MABEAM.Coordination works now.

  • Structure: It’s a single, globally registered GenServer.

  • Process:

    1. A client calls Coordination.start_byzantine_consensus(proposal, agents).
    2. This is a GenServer.call to the single Coordination process.
    3. The Coordination process receives the call, generates a new session ID, and starts a new session process (e.g., a “ConsensusSession” process) under its own local supervision.
    4. It stores the session_id -> session_pid mapping in its own state.
    5. The Coordination GenServer returns the session_id to the client.
    6. The session process then manages the actual consensus algorithm, communicating with the specified agents.
  • Limitations (in a distributed context):

    • Single Point of Failure: If the node hosting the Coordination GenServer fails, the entire coordination system is down. No new sessions can be started.
    • Resource Hotspot: All session processes are spawned on the same node as the Coordination GenServer, concentrating CPU and memory load.
    • State is Centralized: The list of all active sessions lives in the state of one process on one node.

2. The Migration Path: Deconstructing the Singleton with Horde

The migration path involves transforming the singleton GenServer into a stateless facade and distributing the work it used to manage. We will use Horde as the target technology, as the codebase anticipates.

Step 0: Foundational Setup (The “Easy” Part)

This is the groundwork, and the current architecture makes this straightforward.

  1. Add Dependencies: Add {:horde, "~> 0.8"} and {:libcluster, "~> 3.3"} to mix.exs.
  2. Configure Clustering: Configure libcluster in config/runtime.exs with a topology strategy (e.g., Cluster.Strategy.Gossip) so nodes can form a cluster automatically.
  3. Implement the ProcessRegistry Horde Backend: This is the most important prerequisite. The ProcessRegistry.Backend.Horde placeholder needs to be implemented using Horde.Registry. This single change will make agent discovery (Agent.find_by_capability/1, etc.) work across the entire cluster.

Step 1: The Core Refactor (The “Hard” Part)

This is the main effort, focused on refactoring Foundation.MABEAM.Coordination.

  1. Transform the GenServer into a Facade:

    • The Foundation.MABEAM.Coordination module will cease to be a GenServer. It becomes a standard, stateless module.
    • The use GenServer line is removed. init/1, handle_call/3, etc., are removed.
  2. Introduce Distributed Supervision:

    • The application’s main supervisor will now start Horde.DynamicSupervisor instead of the old singleton Coordination GenServer.
    • The start_byzantine_consensus/2 function in the new Coordination facade will change drastically.

    Before (Singleton):

    def handle_call({:start_byzantine_consensus, ...}, _from, state) do
      # ... logic to start a local process ...
      {:reply, {:ok, session_id}, new_state}
    end
    

    After (Distributed Facade):

    def start_byzantine_consensus(proposal, agents) do
      session_id = generate_session_id()
      child_spec = %{
        id: {:session, session_id},
        start: {ConsensusSession, :start_link, [session_id, proposal, agents]}
      }
    
      # Tell Horde to start the session process on the best available node
      Horde.DynamicSupervisor.start_child(MyCoordinationSupervisor, child_spec)
    
      {:ok, session_id}
    end
    

    Horde.DynamicSupervisor automatically handles distributing the ConsensusSession processes across the cluster, providing immediate load balancing and fault tolerance.

  3. Distribute the State:

    • The singleton’s state was a map of active sessions: %{session_id => session_pid}. This map no longer exists in one place.
    • Each ConsensusSession process, when it starts, is now responsible for registering itself in a distributed registry.

    Inside ConsensusSession.init/1:

    def init(session_id, proposal, agents) do
      # Register this session process so it can be found from any node
      Horde.Registry.register(MySessionRegistry, session_id, self())
      # ... rest of the init logic ...
    end
    
  4. Update State-Dependent Logic:

    • Functions like get_session_status/1 or list_active_sessions/0 need to be rewritten.
    • Instead of looking in the GenServer’s state, they will now query the Horde.Registry.

    Before (Singleton):

    def handle_call(:list_active_sessions, _from, state) do
      active_ids = Map.keys(state.active_sessions)
      {:reply, {:ok, active_ids}, state}
    end
    

    After (Distributed Facade):

    def list_active_sessions() do
      # Query the distributed registry to get all registered sessions
      sessions = Horde.Registry.select(MySessionRegistry, [{{:"$1", :_}, [], [:"$1"]}])
      {:ok, sessions}
    end
    

3. Verdict: Is the Path Reasonable?

Yes, the path is eminently reasonable. The architects made the correct trade-off.

  • Why it’s not significantly harder later: The existing abstractions create clear “seams” for the refactor.

    1. Agent Discovery is Solved: The ProcessRegistry abstraction means you only need to implement the Horde backend. All code that finds agents (like Comms) doesn’t need to change. This is a massive win.
    2. The Work is Localized: The refactoring effort is almost entirely contained within the singleton modules (Coordination, Economics) and the main application supervisor. The core business logic inside the session processes remains largely the same.
    3. Avoided Premature Complexity: By not building distributed from day one, the team was able to develop the core coordination and economic logic in a much simpler, easier-to-debug environment. They could focus on the “what” before the “where”.
  • What makes it “non-trivial”?

    1. Shift in Mindset: The developer must shift from thinking about a single, authoritative state to a distributed, queryable state.
    2. New Failure Modes: You now have to consider network partitions and other distributed fallacies.
    3. Implementation Effort: The core logic inside the singleton services must be re-written as described above.

Summary of Changes: Singleton vs. Distributed

AspectSingleton Architecture (Current)Distributed Architecture (Future)
Process SupervisionA single Coordination GenServer starts all session processes locally.Horde.DynamicSupervisor starts session processes on any node in the cluster.
State ManagementA map in the Coordination GenServer’s state holds all active sessions.No central state. Each session registers itself in Horde.Registry. State is discovered by querying the registry.
Process DiscoveryMap.get(state.active_sessions, session_id)Horde.Registry.lookup(MySessionRegistry, session_id)
Public APIFoundation.MABEAM.Coordination is a GenServer that clients call.Foundation.MABEAM.Coordination is a stateless module (facade) that clients call.
Fault ToleranceIf the Coordination node fails, the entire system stops.If a node with a session process fails, Horde can restart it on another node. New sessions can be started on any healthy node.
ComplexityLow. Easy to reason about and debug.High. Requires understanding distributed systems concepts.

The current design is a testament to good API abstraction and planning for the future without over-engineering the present. It’s a solid foundation to build upon.