Excellent question. This gets to the core of the “pay now vs. pay later” debate in software architecture. You’ve correctly identified that the Coordination
and Economics
services are the primary hurdles for distribution.
Here is a detailed case study on what migrating these singleton services would involve, directly addressing your questions about difficulty and path reasonableness.
Case Study: Migrating Singleton Services to a Distributed Architecture
Executive Summary
The current architecture provides a sufficiently reasonable path to distribution. While the refactoring effort is non-trivial, it is not significantly harder than building it distributed from the start. The existing abstractions, particularly the ProcessRegistry
backend, are a crucial down payment on future distribution that localizes the required changes. Migrating later is a sound strategy that avoids premature complexity.
1. The “As-Is” Architecture: Singleton Manager
Let’s analyze how a service like Foundation.MABEAM.Coordination
works now.
Structure: It’s a single, globally registered
GenServer
.Process:
- A client calls
Coordination.start_byzantine_consensus(proposal, agents)
. - This is a
GenServer.call
to the singleCoordination
process. - The
Coordination
process receives the call, generates a new session ID, and starts a new session process (e.g., a “ConsensusSession” process) under its own local supervision. - It stores the
session_id -> session_pid
mapping in its own state. - The
Coordination
GenServer
returns thesession_id
to the client. - The session process then manages the actual consensus algorithm, communicating with the specified agents.
- A client calls
Limitations (in a distributed context):
- Single Point of Failure: If the node hosting the
Coordination
GenServer
fails, the entire coordination system is down. No new sessions can be started. - Resource Hotspot: All session processes are spawned on the same node as the
Coordination
GenServer, concentrating CPU and memory load. - State is Centralized: The list of all active sessions lives in the state of one process on one node.
- Single Point of Failure: If the node hosting the
2. The Migration Path: Deconstructing the Singleton with Horde
The migration path involves transforming the singleton GenServer
into a stateless facade and distributing the work it used to manage. We will use Horde
as the target technology, as the codebase anticipates.
Step 0: Foundational Setup (The “Easy” Part)
This is the groundwork, and the current architecture makes this straightforward.
- Add Dependencies: Add
{:horde, "~> 0.8"}
and{:libcluster, "~> 3.3"}
tomix.exs
. - Configure Clustering: Configure
libcluster
inconfig/runtime.exs
with a topology strategy (e.g.,Cluster.Strategy.Gossip
) so nodes can form a cluster automatically. - Implement the
ProcessRegistry
Horde Backend: This is the most important prerequisite. TheProcessRegistry.Backend.Horde
placeholder needs to be implemented usingHorde.Registry
. This single change will make agent discovery (Agent.find_by_capability/1
, etc.) work across the entire cluster.
Step 1: The Core Refactor (The “Hard” Part)
This is the main effort, focused on refactoring Foundation.MABEAM.Coordination
.
Transform the GenServer into a Facade:
- The
Foundation.MABEAM.Coordination
module will cease to be aGenServer
. It becomes a standard, stateless module. - The
use GenServer
line is removed.init/1
,handle_call/3
, etc., are removed.
- The
Introduce Distributed Supervision:
- The application’s main supervisor will now start
Horde.DynamicSupervisor
instead of the old singletonCoordination
GenServer. - The
start_byzantine_consensus/2
function in the newCoordination
facade will change drastically.
Before (Singleton):
def handle_call({:start_byzantine_consensus, ...}, _from, state) do # ... logic to start a local process ... {:reply, {:ok, session_id}, new_state} end
After (Distributed Facade):
def start_byzantine_consensus(proposal, agents) do session_id = generate_session_id() child_spec = %{ id: {:session, session_id}, start: {ConsensusSession, :start_link, [session_id, proposal, agents]} } # Tell Horde to start the session process on the best available node Horde.DynamicSupervisor.start_child(MyCoordinationSupervisor, child_spec) {:ok, session_id} end
Horde.DynamicSupervisor
automatically handles distributing theConsensusSession
processes across the cluster, providing immediate load balancing and fault tolerance.- The application’s main supervisor will now start
Distribute the State:
- The singleton’s state was a map of active sessions:
%{session_id => session_pid}
. This map no longer exists in one place. - Each
ConsensusSession
process, when it starts, is now responsible for registering itself in a distributed registry.
Inside
ConsensusSession.init/1
:def init(session_id, proposal, agents) do # Register this session process so it can be found from any node Horde.Registry.register(MySessionRegistry, session_id, self()) # ... rest of the init logic ... end
- The singleton’s state was a map of active sessions:
Update State-Dependent Logic:
- Functions like
get_session_status/1
orlist_active_sessions/0
need to be rewritten. - Instead of looking in the
GenServer
’s state, they will now query theHorde.Registry
.
Before (Singleton):
def handle_call(:list_active_sessions, _from, state) do active_ids = Map.keys(state.active_sessions) {:reply, {:ok, active_ids}, state} end
After (Distributed Facade):
def list_active_sessions() do # Query the distributed registry to get all registered sessions sessions = Horde.Registry.select(MySessionRegistry, [{{:"$1", :_}, [], [:"$1"]}]) {:ok, sessions} end
- Functions like
3. Verdict: Is the Path Reasonable?
Yes, the path is eminently reasonable. The architects made the correct trade-off.
Why it’s not significantly harder later: The existing abstractions create clear “seams” for the refactor.
- Agent Discovery is Solved: The
ProcessRegistry
abstraction means you only need to implement the Horde backend. All code that finds agents (likeComms
) doesn’t need to change. This is a massive win. - The Work is Localized: The refactoring effort is almost entirely contained within the singleton modules (
Coordination
,Economics
) and the main application supervisor. The core business logic inside the session processes remains largely the same. - Avoided Premature Complexity: By not building distributed from day one, the team was able to develop the core coordination and economic logic in a much simpler, easier-to-debug environment. They could focus on the “what” before the “where”.
- Agent Discovery is Solved: The
What makes it “non-trivial”?
- Shift in Mindset: The developer must shift from thinking about a single, authoritative state to a distributed, queryable state.
- New Failure Modes: You now have to consider network partitions and other distributed fallacies.
- Implementation Effort: The core logic inside the singleton services must be re-written as described above.
Summary of Changes: Singleton vs. Distributed
Aspect | Singleton Architecture (Current) | Distributed Architecture (Future) |
---|---|---|
Process Supervision | A single Coordination GenServer starts all session processes locally. | Horde.DynamicSupervisor starts session processes on any node in the cluster. |
State Management | A map in the Coordination GenServer’s state holds all active sessions. | No central state. Each session registers itself in Horde.Registry . State is discovered by querying the registry. |
Process Discovery | Map.get(state.active_sessions, session_id) | Horde.Registry.lookup(MySessionRegistry, session_id) |
Public API | Foundation.MABEAM.Coordination is a GenServer that clients call . | Foundation.MABEAM.Coordination is a stateless module (facade) that clients call. |
Fault Tolerance | If the Coordination node fails, the entire system stops. | If a node with a session process fails, Horde can restart it on another node. New sessions can be started on any healthy node. |
Complexity | Low. Easy to reason about and debug. | High. Requires understanding distributed systems concepts. |
The current design is a testament to good API abstraction and planning for the future without over-engineering the present. It’s a solid foundation to build upon.