Codebase Design: Intelligent Service Discovery
1. Overview
The purpose of intelligent service discovery within Foundation 2.0 is to provide a robust, dynamic, and cluster-aware mechanism for locating and connecting to services. This goes beyond simple node discovery by incorporating service capabilities, health status, versioning, and potentially network topology or load considerations. The key module responsible for this functionality is Foundation.Distributed.Discovery
, which will leverage Partisan for inter-node communication.
2. Foundation.Distributed.Discovery
Module
Purpose: This module acts as a distributed registry and query engine for services running within the Foundation cluster. It manages the lifecycle of service registrations, provides a rich query interface for discovering services, and incorporates health and capability awareness.
State Management (GenServer
defstruct
fromFOUNDATION2_04_PARTISAN_DISTRO_REVO.md
): The GenServer will manage several pieces of state, likely backed by ETS tables for concurrent access and efficiency, especially in a busy cluster.defmodule Foundation.Distributed.Discovery do use GenServer defstruct [ :discovery_strategies, # List of atoms like [:local, :cluster, :external], defining search order/scope. :service_cache, # Name of ETS table: {service_id_tuple, node_registered_on} -> service_info_map. service_id_tuple = {service_name, service_instance_id_or_node} :health_monitors, # ETS table or map: service_id_tuple -> health_check_pid_or_ref (for active checks) or last_heartbeat_timestamp. :capability_index, # Name of ETS table: capability_atom -> list_of_service_id_tuples. :subscribers # ETS table or map: subscription_criteria_hash -> list_of_subscriber_pids. ] # ... GenServer callbacks ... end
Note on ETS tables: Using named ETS tables owned by the GenServer process allows other processes (e.g., health checkers) to read/write concurrently if necessary, while the GenServer can still act as the primary mutator or coordinator.
Proposed Core Functions (Elixir syntax, based on
FOUNDATION2_04_PARTISAN_DISTRO_REVO.md
):defmodule Foundation.Distributed.Discovery do use GenServer @doc """ Starts the Service Discovery manager GenServer. Initializes ETS tables and loads configured discovery strategies. """ def start_link(opts \\ []) do # Default strategies: [:local, :cluster] # Opts could override strategies or provide ETS table names/options. GenServer.start_link(__MODULE__, opts, name: __MODULE__) end @impl true def init(opts) do # Initialize ETS tables service_cache_tid = :ets.new(:service_cache, [:set, :protected, :named_table, read_concurrency: true]) capability_index_tid = :ets.new(:capability_index, [:bag, :protected, :named_table, read_concurrency: true]) # :bag for multiple services per capability health_monitors_tid = :ets.new(:health_monitors, [:set, :protected, :named_table, read_concurrency: true]) subscribers_tid = :ets.new(:discovery_subscribers, [:set, :protected, :named_table, read_concurrency: true]) state = %{ discovery_strategies: Keyword.get(opts, :discovery_strategies, [:local, :cluster]), service_cache: service_cache_tid, health_monitors: health_monitors_tid, capability_index: capability_index_tid, subscribers: subscribers_tid } {:ok, state} end @doc """ Registers a service with its name, PID, capabilities, and other metadata. The service is registered on the local node and this registration is broadcast to other discovery managers in the cluster. """ def register_service(service_name :: atom(), pid :: pid(), capabilities :: list(atom()), metadata :: map() \\ %{}) :: :ok | {:error, any()} do node = Node.self() service_id = {service_name, node} # Assuming one instance of a named service per node for simplicity here. Could be more complex. service_info = %{ name: service_name, node: node, pid: pid, capabilities: Enum.uniq(capabilities), version: Map.get(metadata, :version), metadata: metadata, health_status: :unknown, # Initial status, to be updated by health monitoring registered_at: System.monotonic_time(:millisecond) } GenServer.call(__MODULE__, {:register_service, service_id, service_info}) end @doc """ Deregisters a service instance previously registered with `service_name` on the given `node`. Typically called when a service shuts down gracefully. """ def deregister_service(service_name :: atom(), node :: atom() \\ Node.self()) :: :ok | {:error, any()} do service_id = {service_name, node} GenServer.call(__MODULE__, {:deregister_service, service_id}) end @doc """ Discovers services based on a set of criteria. Criteria can include :service_name, :capabilities (list, all must match), :health_status, :version_match (string or regex), :node, etc. """ def discover_services(criteria :: Keyword.t() | map()) :: {:ok, list(map())} | {:error, any()} do # This call will be handled by the GenServer, which implements the discover_with_criteria/2 logic. GenServer.call(__MODULE__, {:discover_services, criteria}) end @doc """ Gets the current health status of a specific service instance. """ def get_service_health(service_name :: atom(), node :: atom()) :: {:ok, atom()} | {:error, any()} do service_id = {service_name, node} GenServer.call(__MODULE__, {:get_service_health, service_id}) end @doc """ Subscribes the calling process to receive messages when services matching the given criteria are registered or deregistered, or their health status changes. """ def subscribe(criteria :: Keyword.t() | map()) :: :ok do # The GenServer will store {hashed_criteria, self()} in :subscribers table/map. GenServer.cast(__MODULE__, {:subscribe, criteria, self()}) end # ... GenServer handle_call/handle_cast implementations ... # handle_call for :register_service would write to ETS, update index, and broadcast. # handle_call for :discover_services would query ETS tables based on criteria. end
3. Data Structures
Service Information (
service_info_map
stored in:service_cache
ETS table): The value associated with aservice_id
key in the:service_cache
.%{ name: :my_db_server, # Atom: The logical name of the service type instance_id: "worker_73a", # Optional: A unique ID for this instance if multiple of the same name exist on a node node: :node1@example.com, # Atom: The node where the service is running pid: #PID<0.1234.0>, # PID: The process identifier of the service capabilities: [:read_replica, :ha, :timeseries_optimized], # List of atoms: Features this service instance offers version: "1.2.3", # String: Version of the service software metadata: %{ # Map: Application-specific metadata region: "us-east-1", load_factor: 0.3, shards_managed: [1, 5, 7] }, health_status: :healthy, # Atom: e.g., :healthy, :degraded, :unhealthy, :unknown registered_at: 1678886400123, # Integer: System.monotonic_time(:millisecond) at registration last_health_update: 1678886460345 # Integer: Timestamp of last health status update }
Capability Index (
:capability_index
ETS table - type:bag
):- Key:
capability_atom
(e.g.,:read_replica
) - Value:
service_id
(e.g.,{{:my_db_server, "worker_73a"}, :[email protected]}
) A:bag
table allows multiple services to be associated with the same capability.
- Key:
4. Service Registration and Deregistration
Registration (
register_service/4
):- The calling service invokes
Foundation.Distributed.Discovery.register_service(...)
on its local node. - The local
Foundation.Distributed.Discovery
GenServer receives the request. - It validates the data and inserts/updates the
service_info_map
in its local:service_cache
ETS table. - For each capability in
service_info.capabilities
, it adds an entry to the local:capability_index
ETS table:{capability, service_id}
. - The local manager then broadcasts the complete
service_info_map
to all other knownFoundation.Distributed.Discovery
managers in the cluster (e.g., using Partisan broadcast on a dedicated:discovery_events
channel). - Remote
Foundation.Distributed.Discovery
instances receive this broadcast and update their local caches and indices accordingly. - Relevant subscribers are notified.
- The calling service invokes
Deregistration (
deregister_service/2
):- Follows a similar pattern: local ETS removal, then broadcast of the
service_id
to be deregistered. - Remote instances remove the service from their caches and capability index.
- Relevant subscribers are notified.
- Follows a similar pattern: local ETS removal, then broadcast of the
Handling Node Failures (Partisan Integration):
Foundation.Distributed.Discovery
(or a dedicated process it supervises) should subscribe to Partisan’s node membership events (e.g.,node_down
fromFoundation.BEAM.Distribution
).- Upon receiving a
node_down
event for a particular node, the local Discovery manager iterates through its:service_cache
and removes all services registered from that downed node. - These local removals also update the
:capability_index
. - This cleanup is local based on the Partisan event; no separate broadcast is needed as all nodes receive the same Partisan
node_down
event and perform cleanup independently. Subscribers are notified of the effective deregistration.
5. Discovery Logic (discover_with_criteria/2
implemented in GenServer)
The discover_services
call triggers logic that queries the local ETS tables:
Initial Set from Name/Capability:
- If
criteria[:service_name]
is provided, fetch services matching that name. - If
criteria[:capabilities]
are provided, use the:capability_index
to get a list of services for each capability. Find the intersection of these lists (services that have ALL specified capabilities). - If both are provided, use the intersection of results from name lookup and capability indexing.
- If neither, start with a broader list (e.g., all services, though this should be discouraged for performance).
- If
Filtering by Other Criteria:
- Iterate through the initial set of
service_info_map
s. - Health Status: Filter out services whose
:health_status
does not matchcriteria[:health_status]
(defaulting to:healthy
if not specified). - Version: If
criteria[:version_match]
is given, filter by string equality or regex match againstservice_info.version
. - Node: If
criteria[:node]
is given, filter for services on that specific node. - Metadata: If
criteria[:metadata_match]
(e.g.,%{region: "us-east-1"}
) is provided, perform sub-map matching againstservice_info.metadata
.
- Iterate through the initial set of
Topology/Proximity Awareness (Advanced):
- After basic filtering, if multiple candidates remain, apply proximity rules:
- Prefer services on
Node.self()
. - If network topology information is available (e.g., from
Foundation.Distributed.Topology
or external config like AWS AZ), prefer services in the same logical grouping (rack, AZ, region). - This requires
service_info.node
and potentiallyservice_info.metadata
(e.g.,region
) to be queryable.
- Prefer services on
- After basic filtering, if multiple candidates remain, apply proximity rules:
Load Balancing Integration (Optional):
- If multiple suitable, healthy, and proximate services remain, a simple strategy is random selection.
- For more advanced load balancing:
- The
service_info.metadata
might include aload_factor
orqueue_depth
. Foundation.Distributed.Discovery
could select the instance with the lowest reported load.- Alternatively, it could return a list of suitable candidates to a client-side load balancer or a dedicated load balancing module. The mention of
:load_balancer
inFoundation.Distributed.Channels
suggests that routing/load balancing might be a separate concern thatDiscovery
could feed into.
- The
Return: A list of
service_info_map
s for the selected services.
6. Health Monitoring
Mechanism Options:
- Active Polling:
Foundation.Distributed.Discovery
(or worker processes it spawns, stored in:health_monitors
) periodically calls a standardized health check function on each registered service’spid
.- Services can implement a
Foundation.Contracts.HealthCheck
behaviour (e.g.,check_health() -> :healthy | :degraded | :unhealthy
).
- Passive Heartbeating:
- Registered services periodically send a heartbeat message (e.g., via
GenServer.cast
orsend
) to their localFoundation.Distributed.Discovery
manager. - The manager tracks the
last_heartbeat_timestamp
in:health_monitors
. If a heartbeat is missed for a configurable timeout, the service is marked:unhealthy
or:unknown
.
- Registered services periodically send a heartbeat message (e.g., via
- A combination might be used (e.g., heartbeats, with polling as a fallback).
- Active Polling:
Updates: The health status is updated in the
:service_cache
for the specific service instance.Event Broadcast: Changes in health status (especially to
:unhealthy
or back to:healthy
) should be broadcast to other nodes so their caches are updated. This ensures all nodes have a relatively consistent view of service health.Discovery Impact: Unhealthy services are typically filtered out from
discover_services
results unless the criteria explicitly request them (e.g.,health_status: :any
for diagnostic purposes).
7. Scalability for Thousands of Nodes
- Distributed Cache & Local Reads: Each node having a local ETS cache of (most) services means discovery queries are primarily local and fast, avoiding a centralized bottleneck.
- Efficient Indexing: The
:capability_index
(and potentially other indices on name, etc.) significantly speeds up the initial filtering phase of discovery. - Partisan for Eventual Consistency: Registration, deregistration, and health update broadcasts use Partisan channels. This makes the system eventually consistent. The choice of Partisan overlay and channel configuration will impact propagation delay.
- Challenges:
- Cache Consistency Latency: Updates take time to propagate across a large cluster. There might be brief periods where different nodes have slightly different views of available services. This is typical for eventually consistent systems.
- Broadcast Storms: Mass registration/deregistration events (e.g., during a large deployment or a network partition healing) could lead to a surge in broadcast messages.
- Mitigation: Batch updates, use efficient Partisan broadcast overlays (like Plumtree via
:pub_sub
topology if suitable for these events), potentially use delta updates instead of fullservice_info_map
s for health changes.
- Mitigation: Batch updates, use efficient Partisan broadcast overlays (like Plumtree via
- ETS Table Size: On nodes with many services or in very large clusters, the ETS tables could grow large. Monitor memory usage. Partisan’s
Node.list()
itself has scalability considerations.
8. Integration
Foundation.ServiceRegistry
: IfFoundation.ServiceRegistry
exists as a higher-level, simpler API (as hinted inBATTLE_PLAN_LIBCLUSTER_PARTISAN.md
), it would likely be a client ofFoundation.Distributed.Discovery
, translating its API calls into criteria-based searches.- Client Applications: Directly use
Foundation.Distributed.Discovery.discover_services/1
to find necessary backend services. Foundation.BEAM.Distribution
/ Partisan: Used for the underlying inter-node communication of registration/health broadcasts. Node up/down events from this layer are crucial for cleaning up stale service entries.Foundation.Distributed.Topology
: Proximity-based service discovery would need to query the current topology or node metadata from this module.
9. Open Questions / Future Work
- Consistency Models: Explore options for stronger consistency if required by certain applications, though this typically impacts scalability (e.g., using Raft/Paxos for service registration, which Partisan might support via specific channels/modules).
- Advanced Query Language: For more complex discovery needs, a more expressive query language (beyond simple keyword lists) could be designed (e.g., supporting OR conditions, range queries on metadata).
- External Discovery System Integration:
- Push/Pull: Mechanisms to synchronize service information with external systems like Consul, etcd, or Kubernetes service discovery. Foundation services could be published externally, or external services could be made discoverable within Foundation.
- Service Versioning & Compatibility: More sophisticated handling of service versions, allowing clients to request services compatible with a specific version range (e.g.,
~> 1.2
). - Weighted/Priority-Based Discovery: Allow services to register with a weight or priority, influencing their selection during discovery, especially for load balancing.
- Security: Secure service registration and discovery (e.g., ACLs on who can register/discover certain services). Authenticated broadcasts.