Foundation 2.0: Technical Specifications & Design Documents
Document Set Overview
This document set provides the technical foundation for implementing Foundation 2.0 based on our synthesis of “Smart Facades on a Pragmatic Core” with intentionally leaky abstractions.
1. API Contract Specification
Layer 1: Pragmatic Core APIs
Configuration System
# Mortal Mode (Zero Config)
config :foundation, cluster: true
# Apprentice Mode (Simple Config)
config :foundation, cluster: :kubernetes
config :foundation, cluster: :consul
config :foundation, cluster: :dns
config :foundation, cluster: [strategy: :gossip, secret: "my-secret"]
# Wizard Mode (Full libcluster passthrough)
config :libcluster, topologies: [...] # Foundation detects and defers
Core Supervision Contract
# Foundation.Supervisor must provide these child processes when clustering enabled:
- {Cluster.Supervisor, resolved_topology}
- {Phoenix.PubSub, name: Foundation.PubSub, adapter: Phoenix.PubSub.PG2}
- {Horde.Registry, name: Foundation.ProcessRegistry, keys: :unique, members: :auto}
- {Horde.DynamicSupervisor, name: Foundation.DistributedSupervisor, members: :auto}
Layer 2: Smart Facade APIs
Process Management Facade
# Primary API Contract
@spec start_singleton(module(), args :: list(), opts :: keyword()) ::
{:ok, pid()} | {:error, term()}
@spec start_replicated(module(), args :: list(), opts :: keyword()) ::
list({:ok, pid()} | {:error, term()})
@spec lookup_singleton(term()) :: {:ok, pid()} | :not_found
# Required Options
opts :: [
name: term(), # Registry key (optional)
restart: :permanent | :temporary | :transient,
timeout: pos_integer() | :infinity
]
Service Discovery Facade
# Primary API Contract
@spec register_service(name :: term(), pid(), capabilities :: list(), metadata :: map()) ::
:ok | {:error, term()}
@spec discover_services(criteria :: keyword()) :: list(service_info())
@type service_info() :: %{
name: term(),
pid: pid(),
node: atom(),
capabilities: list(),
metadata: map(),
registered_at: integer()
}
# Discovery Criteria
criteria :: [
name: term(),
capabilities: list(),
node: atom(),
health_check: (service_info() -> boolean())
]
Channel Communication Facade
# Primary API Contract
@spec broadcast(channel :: atom(), message :: term(), opts :: keyword()) :: :ok
@spec subscribe(channel :: atom(), handler :: pid()) :: :ok
@spec route_message(message :: term(), opts :: keyword()) :: :ok
# Standard Channels
@type channel() :: :control | :events | :telemetry | :data
# Routing Options
opts :: [
channel: channel(),
priority: :low | :normal | :high,
compression: boolean(),
timeout: pos_integer()
]
Layer 3: Direct Tool Access
Foundation MUST provide these registered processes for direct access:
Foundation.PubSub
(Phoenix.PubSub process)Foundation.ProcessRegistry
(Horde.Registry process)Foundation.DistributedSupervisor
(Horde.DynamicSupervisor process)Foundation.ClusterSupervisor
(Cluster.Supervisor process)
2. Architecture Decision Records (ADRs)
ADR-001: Leaky Abstractions by Design
Status: Accepted
Date: 2024-12-XX
Deciders: Claude & Gemini Synthesis
Context
We need to decide whether Foundation’s facades should hide or expose the underlying tool implementations.
Decision
We will implement intentionally leaky abstractions that celebrate rather than hide the underlying tools.
Rationale
- Debuggability: Developers can understand what’s happening under the hood
- No Lock-in: Easy to drop to tool level when needed
- Learning Path: Smooth progression from simple to advanced usage
- Community Leverage: Builds on existing knowledge rather than replacing it
Implementation
- All facades are thin, stateless wrappers
- Facade source code clearly shows underlying tool calls
- Direct tool access is always available and documented
- Error messages reference both facade and underlying tool
ADR-002: Three-Layer Configuration Model
Status: Accepted
Date: 2024-12-XX
Context
We need a configuration system that serves beginners and experts equally well.
Decision
Implement Mortal/Apprentice/Wizard three-tier configuration system.
Rationale
- Progressive Disclosure: Start simple, add complexity as needed
- Migration Friendly: Wizard mode preserves existing libcluster configs
- Environment Appropriate: Different complexity levels for different deployment scenarios
Implementation
# Mortal: Foundation controls everything
config :foundation, cluster: true
# Apprentice: Foundation translates high-level config
config :foundation, cluster: :kubernetes
# Wizard: Foundation defers to existing libcluster config
config :libcluster, topologies: [...]
ADR-003: Phoenix.PubSub for Application-Layer Channels
Status: Accepted
Date: 2024-12-XX
Context
Distributed Erlang has head-of-line blocking issues. We need application-layer channel separation.
Decision
Use Phoenix.PubSub with standardized topic namespaces to create logical channels over Distributed Erlang.
Rationale
- Proven Solution: Phoenix.PubSub is battle-tested for distributed messaging
- Logical Separation: Different message types use different topics
- No Transport Changes: Works over standard Distributed Erlang
- Ecosystem Integration: Leverages existing Phoenix.PubSub knowledge
Implementation
# Standard channels map to PubSub topics
:control -> "foundation:control"
:events -> "foundation:events"
:telemetry -> "foundation:telemetry"
:data -> "foundation:data"
ADR-004: Horde for Distributed Process Management
Status: Accepted
Date: 2024-12-XX
Context
We need distributed process registry and supervision capabilities.
Decision
Use Horde as the single source of truth for distributed processes, with facades providing common patterns.
Rationale
- CRDT Foundation: Horde’s delta-CRDT approach handles network partitions well
- OTP Compatibility: Familiar APIs that match standard OTP patterns
- Active Maintenance: Well-maintained with good community support
- Proven Patterns: Existing successful deployments validate the approach
Implementation
- Foundation.ProcessRegistry = Horde.Registry instance
- Foundation.DistributedSupervisor = Horde.DynamicSupervisor instance
- Facades provide common patterns (singleton, replicated, partitioned)
3. Testing Strategy & Requirements
Testing Pyramid
Unit Tests (60% of coverage)
# Test each facade function in isolation
test "ProcessManager.start_singleton creates child and registers" do
assert {:ok, pid} = Foundation.ProcessManager.start_singleton(TestWorker, [], name: :test)
assert {:ok, ^pid} = Foundation.ProcessManager.lookup_singleton(:test)
assert [{^pid, _}] = Horde.Registry.lookup(Foundation.ProcessRegistry, :test)
end
# Test configuration translation
test "ClusterConfig translates :kubernetes to libcluster topology" do
topology = Foundation.ClusterConfig.translate_kubernetes_config(app_name: "test-app")
assert topology[:strategy] == Cluster.Strategy.Kubernetes
assert topology[:config][:kubernetes_selector] == "app=test-app"
end
Integration Tests (30% of coverage)
# Multi-node cluster formation
test "cluster forms correctly with mdns_lite strategy" do
nodes = start_cluster([:"[email protected]", :"[email protected]"])
# Verify cluster formation
Enum.each(nodes, fn node ->
cluster_nodes = :rpc.call(node, Node, :list, [])
assert length(cluster_nodes) == 1 # Other node connected
end)
end
# Service discovery across nodes
test "services registered on one node are discoverable from another" do
{node1, node2} = start_two_node_cluster()
# Register service on node1
:ok = :rpc.call(node1, Foundation.ServiceMesh, :register_service,
[:test_service, spawn_service(), [:capability_a]])
# Discover from node2
services = :rpc.call(node2, Foundation.ServiceMesh, :discover_services,
[name: :test_service])
assert length(services) == 1
end
End-to-End Tests (10% of coverage)
# Complete application scenarios
test "elixir_scope distributed debugging workflow" do
cluster = start_three_node_cluster()
# Start ElixirScope on cluster
start_elixir_scope_on_cluster(cluster)
# Execute distributed operation with tracing
trace_id = ElixirScope.start_distributed_trace()
result = execute_distributed_workflow(cluster, trace_id)
# Verify trace data collected from all nodes
trace_data = ElixirScope.get_trace_data(trace_id)
assert trace_data.nodes == cluster
assert trace_data.complete == true
end
Performance Benchmarks
Cluster Formation Speed
@tag :benchmark
test "cluster formation completes within SLA" do
start_time = System.monotonic_time(:millisecond)
nodes = start_cluster_async(5)
wait_for_full_cluster_formation(nodes)
formation_time = System.monotonic_time(:millisecond) - start_time
assert formation_time < 30_000 # 30 seconds SLA
end
Message Throughput
@tag :benchmark
test "channel messaging meets throughput requirements" do
cluster = start_cluster(3)
# Measure messages per second across channels
throughput = measure_channel_throughput(cluster, duration: 10_000)
assert throughput[:control] > 1_000 # 1k/sec control messages
assert throughput[:events] > 5_000 # 5k/sec event messages
assert throughput[:data] > 10_000 # 10k/sec data messages
end
Chaos Testing Requirements
Network Partitions
test "cluster heals after network partition" do
cluster = start_cluster(5)
# Create partition: [node1, node2] vs [node3, node4, node5]
create_network_partition(cluster, split: 2)
# Verify each partition continues operating
verify_partition_operations(cluster)
# Heal partition
heal_network_partition(cluster)
# Verify cluster convergence
assert_eventually(fn -> cluster_fully_converged?(cluster) end, 30_000)
end
Node Failures
test "services migrate when nodes fail" do
cluster = start_cluster(3)
# Start singleton service
{:ok, service_pid} = Foundation.ProcessManager.start_singleton(TestService, [])
original_node = node(service_pid)
# Kill the node hosting the service
kill_node(original_node)
# Verify service restarts on another node
assert_eventually(fn ->
case Foundation.ProcessManager.lookup_singleton(TestService) do
{:ok, new_pid} -> node(new_pid) != original_node
_ -> false
end
end, 10_000)
end
4. Performance Characteristics & SLAs
Cluster Formation SLAs
Cluster Size | Formation Time | 99th Percentile |
---|---|---|
2-5 nodes | < 10 seconds | < 15 seconds |
6-20 nodes | < 30 seconds | < 45 seconds |
21-100 nodes | < 2 minutes | < 3 minutes |
Message Latency SLAs
Message Type | Average Latency | 99th Percentile |
---|---|---|
Control | < 5ms | < 20ms |
Events | < 10ms | < 50ms |
Telemetry | < 50ms | < 200ms |
Data | < 100ms | < 500ms |
Resource Utilization Targets
Memory Overhead
- Foundation core: < 50MB per node
- Per-service overhead: < 1MB per registered service
- Channel overhead: < 10MB per active channel
CPU Utilization
- Idle cluster: < 1% CPU for Foundation processes
- Active messaging: < 5% CPU overhead vs direct Distributed Erlang
- Cluster formation: < 30% CPU spike, duration < 60 seconds
Network Bandwidth
- Control messages: < 1KB/message average
- Heartbeat overhead: < 100 bytes/second per node pair
- Service discovery: < 10KB per discovery operation
5. Security & Operational Requirements
Security Model
Cluster Authentication
# Foundation must support secure cluster formation
config :foundation,
cluster: :kubernetes,
security: [
erlang_cookie: {:system, "ERLANG_COOKIE"},
tls_enabled: true,
certificate_file: "/etc/ssl/certs/cluster.pem"
]
Service Authorization
# Services can declare required capabilities
Foundation.ServiceMesh.register_service(
:payment_service,
self(),
[:payment_processing],
%{security_level: :high, audit_required: true}
)
# Discovery can filter by security requirements
Foundation.ServiceMesh.discover_services(
capabilities: [:payment_processing],
security_level: :high
)
Operational Requirements
Health Monitoring
# Foundation must expose health endpoints
Foundation.HealthMonitor.cluster_health()
# => %{
# status: :healthy | :degraded | :critical,
# nodes: [{node, status, metrics}],
# services: [{service, status, instances}],
# last_check: timestamp
# }
Metrics Collection
# Standard telemetry events Foundation must emit
[:foundation, :cluster, :node_joined] => %{node: atom()}
[:foundation, :cluster, :node_left] => %{node: atom(), reason: term()}
[:foundation, :service, :registered] => %{name: term(), node: atom()}
[:foundation, :service, :deregistered] => %{name: term(), node: atom()}
[:foundation, :message, :sent] => %{channel: atom(), size: integer()}
[:foundation, :message, :received] => %{channel: atom(), latency: integer()}
Logging Standards
# Foundation must provide structured logging
Logger.info("Foundation cluster formed",
cluster_id: cluster_id,
node_count: 3,
formation_time_ms: 15_432,
strategy: :kubernetes
)
Logger.warning("Service health check failed",
service: :user_service,
node: :"[email protected]",
consecutive_failures: 3,
action: :marking_unhealthy
)
Deployment Requirements
Container Support
# Foundation must work in containerized environments
FROM elixir:1.15-alpine
# Required for clustering
RUN apk add --no-cache netcat-openbsd
# Foundation clustering just works
COPY . /app
WORKDIR /app
RUN mix deps.get && mix compile
# Single environment variable enables clustering
ENV FOUNDATION_CLUSTER=kubernetes
CMD ["mix", "run", "--no-halt"]
Kubernetes Integration
# Foundation must integrate with Kubernetes service discovery
apiVersion: apps/v1
kind: Deployment
metadata:
name: foundation-app
spec:
selector:
matchLabels:
app: foundation-app
template:
metadata:
labels:
app: foundation-app
spec:
containers:
- name: app
image: foundation-app:latest
env:
- name: FOUNDATION_CLUSTER
value: "kubernetes"
- name: FOUNDATION_APP_NAME
value: "foundation-app"
6. Migration & Compatibility Guide
From Existing libcluster
Zero-Change Migration
# Existing libcluster config continues to work
config :libcluster,
topologies: [
k8s: [
strategy: Cluster.Strategy.Kubernetes,
config: [...]
]
]
# Just add Foundation dependency - no config changes needed
{:foundation, "~> 2.0"}
Gradual Enhancement Migration
# Phase 1: Add Foundation, keep libcluster config
{:foundation, "~> 2.0"}
# Phase 2: Start using Foundation services
Foundation.ServiceMesh.register_service(:my_service, self())
# Phase 3: Switch to Foundation clustering
config :foundation, cluster: :kubernetes
# Remove old libcluster config
From Plain OTP/GenServer
Service Registry Migration
# Before: Manual process registry
GenServer.start_link(MyService, [], name: {:global, :my_service})
# After: Foundation service mesh
Foundation.ProcessManager.start_singleton(MyService, [], name: :my_service)
PubSub Migration
# Before: Manual message broadcasting
Enum.each(Node.list(), fn node ->
send({:my_service, node}, message)
end)
# After: Foundation channels
Foundation.Channels.broadcast(:events, message)
API Compatibility Matrix
Foundation 1.x API | Foundation 2.0 Support | Notes |
---|---|---|
Foundation.Config.* | ✅ Full compatibility | All existing APIs unchanged |
Foundation.Events.* | ✅ Full compatibility | Enhanced with distributed features |
Foundation.Telemetry.* | ✅ Full compatibility | Cluster aggregation added |
Foundation.ServiceRegistry.* | ✅ Full compatibility | Enhanced with Horde backend |
This technical specification provides the foundation for implementing Foundation 2.0 with confidence, ensuring we build exactly what we’ve envisioned in our synthesis discussions.