← Back to DocsFoundation2 1

F2 040 TECH SPECS DESIGN DOCS claude

Documentation for F2_040_TECH_SPECS_DESIGN_DOCS_claude from the Foundation repository.

Foundation 2.0: Technical Specifications & Design Documents

Document Set Overview

This document set provides the technical foundation for implementing Foundation 2.0 based on our synthesis of “Smart Facades on a Pragmatic Core” with intentionally leaky abstractions.


1. API Contract Specification

Layer 1: Pragmatic Core APIs

Configuration System

# Mortal Mode (Zero Config)
config :foundation, cluster: true

# Apprentice Mode (Simple Config)
config :foundation, cluster: :kubernetes
config :foundation, cluster: :consul
config :foundation, cluster: :dns
config :foundation, cluster: [strategy: :gossip, secret: "my-secret"]

# Wizard Mode (Full libcluster passthrough)
config :libcluster, topologies: [...]  # Foundation detects and defers

Core Supervision Contract

# Foundation.Supervisor must provide these child processes when clustering enabled:
- {Cluster.Supervisor, resolved_topology}
- {Phoenix.PubSub, name: Foundation.PubSub, adapter: Phoenix.PubSub.PG2}
- {Horde.Registry, name: Foundation.ProcessRegistry, keys: :unique, members: :auto}
- {Horde.DynamicSupervisor, name: Foundation.DistributedSupervisor, members: :auto}

Layer 2: Smart Facade APIs

Process Management Facade

# Primary API Contract
@spec start_singleton(module(), args :: list(), opts :: keyword()) ::
  {:ok, pid()} | {:error, term()}

@spec start_replicated(module(), args :: list(), opts :: keyword()) ::
  list({:ok, pid()} | {:error, term()})

@spec lookup_singleton(term()) :: {:ok, pid()} | :not_found

# Required Options
opts :: [
  name: term(),           # Registry key (optional)
  restart: :permanent | :temporary | :transient,
  timeout: pos_integer() | :infinity
]

Service Discovery Facade

# Primary API Contract
@spec register_service(name :: term(), pid(), capabilities :: list(), metadata :: map()) ::
  :ok | {:error, term()}

@spec discover_services(criteria :: keyword()) :: list(service_info())

@type service_info() :: %{
  name: term(),
  pid: pid(),
  node: atom(),
  capabilities: list(),
  metadata: map(),
  registered_at: integer()
}

# Discovery Criteria
criteria :: [
  name: term(),
  capabilities: list(),
  node: atom(),
  health_check: (service_info() -> boolean())
]

Channel Communication Facade

# Primary API Contract  
@spec broadcast(channel :: atom(), message :: term(), opts :: keyword()) :: :ok
@spec subscribe(channel :: atom(), handler :: pid()) :: :ok
@spec route_message(message :: term(), opts :: keyword()) :: :ok

# Standard Channels
@type channel() :: :control | :events | :telemetry | :data

# Routing Options
opts :: [
  channel: channel(),
  priority: :low | :normal | :high,
  compression: boolean(),
  timeout: pos_integer()
]

Layer 3: Direct Tool Access

Foundation MUST provide these registered processes for direct access:

  • Foundation.PubSub (Phoenix.PubSub process)
  • Foundation.ProcessRegistry (Horde.Registry process)
  • Foundation.DistributedSupervisor (Horde.DynamicSupervisor process)
  • Foundation.ClusterSupervisor (Cluster.Supervisor process)

2. Architecture Decision Records (ADRs)

ADR-001: Leaky Abstractions by Design

Status: Accepted
Date: 2024-12-XX
Deciders: Claude & Gemini Synthesis

Context

We need to decide whether Foundation’s facades should hide or expose the underlying tool implementations.

Decision

We will implement intentionally leaky abstractions that celebrate rather than hide the underlying tools.

Rationale

  • Debuggability: Developers can understand what’s happening under the hood
  • No Lock-in: Easy to drop to tool level when needed
  • Learning Path: Smooth progression from simple to advanced usage
  • Community Leverage: Builds on existing knowledge rather than replacing it

Implementation

  • All facades are thin, stateless wrappers
  • Facade source code clearly shows underlying tool calls
  • Direct tool access is always available and documented
  • Error messages reference both facade and underlying tool

ADR-002: Three-Layer Configuration Model

Status: Accepted
Date: 2024-12-XX

Context

We need a configuration system that serves beginners and experts equally well.

Decision

Implement Mortal/Apprentice/Wizard three-tier configuration system.

Rationale

  • Progressive Disclosure: Start simple, add complexity as needed
  • Migration Friendly: Wizard mode preserves existing libcluster configs
  • Environment Appropriate: Different complexity levels for different deployment scenarios

Implementation

# Mortal: Foundation controls everything
config :foundation, cluster: true

# Apprentice: Foundation translates high-level config  
config :foundation, cluster: :kubernetes

# Wizard: Foundation defers to existing libcluster config
config :libcluster, topologies: [...]

ADR-003: Phoenix.PubSub for Application-Layer Channels

Status: Accepted
Date: 2024-12-XX

Context

Distributed Erlang has head-of-line blocking issues. We need application-layer channel separation.

Decision

Use Phoenix.PubSub with standardized topic namespaces to create logical channels over Distributed Erlang.

Rationale

  • Proven Solution: Phoenix.PubSub is battle-tested for distributed messaging
  • Logical Separation: Different message types use different topics
  • No Transport Changes: Works over standard Distributed Erlang
  • Ecosystem Integration: Leverages existing Phoenix.PubSub knowledge

Implementation

# Standard channels map to PubSub topics
:control -> "foundation:control"
:events -> "foundation:events"  
:telemetry -> "foundation:telemetry"
:data -> "foundation:data"

ADR-004: Horde for Distributed Process Management

Status: Accepted
Date: 2024-12-XX

Context

We need distributed process registry and supervision capabilities.

Decision

Use Horde as the single source of truth for distributed processes, with facades providing common patterns.

Rationale

  • CRDT Foundation: Horde’s delta-CRDT approach handles network partitions well
  • OTP Compatibility: Familiar APIs that match standard OTP patterns
  • Active Maintenance: Well-maintained with good community support
  • Proven Patterns: Existing successful deployments validate the approach

Implementation

  • Foundation.ProcessRegistry = Horde.Registry instance
  • Foundation.DistributedSupervisor = Horde.DynamicSupervisor instance
  • Facades provide common patterns (singleton, replicated, partitioned)

3. Testing Strategy & Requirements

Testing Pyramid

Unit Tests (60% of coverage)

# Test each facade function in isolation
test "ProcessManager.start_singleton creates child and registers" do
  assert {:ok, pid} = Foundation.ProcessManager.start_singleton(TestWorker, [], name: :test)
  assert {:ok, ^pid} = Foundation.ProcessManager.lookup_singleton(:test)
  assert [{^pid, _}] = Horde.Registry.lookup(Foundation.ProcessRegistry, :test)
end

# Test configuration translation
test "ClusterConfig translates :kubernetes to libcluster topology" do
  topology = Foundation.ClusterConfig.translate_kubernetes_config(app_name: "test-app")
  assert topology[:strategy] == Cluster.Strategy.Kubernetes
  assert topology[:config][:kubernetes_selector] == "app=test-app"
end

Integration Tests (30% of coverage)

# Multi-node cluster formation
test "cluster forms correctly with mdns_lite strategy" do
  nodes = start_cluster([:"[email protected]", :"[email protected]"])
  
  # Verify cluster formation
  Enum.each(nodes, fn node ->
    cluster_nodes = :rpc.call(node, Node, :list, [])
    assert length(cluster_nodes) == 1  # Other node connected
  end)
end

# Service discovery across nodes
test "services registered on one node are discoverable from another" do
  {node1, node2} = start_two_node_cluster()
  
  # Register service on node1
  :ok = :rpc.call(node1, Foundation.ServiceMesh, :register_service, 
    [:test_service, spawn_service(), [:capability_a]])
  
  # Discover from node2
  services = :rpc.call(node2, Foundation.ServiceMesh, :discover_services, 
    [name: :test_service])
  assert length(services) == 1
end

End-to-End Tests (10% of coverage)

# Complete application scenarios
test "elixir_scope distributed debugging workflow" do
  cluster = start_three_node_cluster()
  
  # Start ElixirScope on cluster
  start_elixir_scope_on_cluster(cluster)
  
  # Execute distributed operation with tracing
  trace_id = ElixirScope.start_distributed_trace()
  result = execute_distributed_workflow(cluster, trace_id)
  
  # Verify trace data collected from all nodes
  trace_data = ElixirScope.get_trace_data(trace_id)
  assert trace_data.nodes == cluster
  assert trace_data.complete == true
end

Performance Benchmarks

Cluster Formation Speed

@tag :benchmark
test "cluster formation completes within SLA" do
  start_time = System.monotonic_time(:millisecond)
  
  nodes = start_cluster_async(5)
  wait_for_full_cluster_formation(nodes)
  
  formation_time = System.monotonic_time(:millisecond) - start_time
  assert formation_time < 30_000  # 30 seconds SLA
end

Message Throughput

@tag :benchmark  
test "channel messaging meets throughput requirements" do
  cluster = start_cluster(3)
  
  # Measure messages per second across channels
  throughput = measure_channel_throughput(cluster, duration: 10_000)
  
  assert throughput[:control] > 1_000    # 1k/sec control messages
  assert throughput[:events] > 5_000     # 5k/sec event messages  
  assert throughput[:data] > 10_000      # 10k/sec data messages
end

Chaos Testing Requirements

Network Partitions

test "cluster heals after network partition" do
  cluster = start_cluster(5)
  
  # Create partition: [node1, node2] vs [node3, node4, node5]
  create_network_partition(cluster, split: 2)
  
  # Verify each partition continues operating
  verify_partition_operations(cluster)
  
  # Heal partition
  heal_network_partition(cluster)
  
  # Verify cluster convergence
  assert_eventually(fn -> cluster_fully_converged?(cluster) end, 30_000)
end

Node Failures

test "services migrate when nodes fail" do
  cluster = start_cluster(3)
  
  # Start singleton service
  {:ok, service_pid} = Foundation.ProcessManager.start_singleton(TestService, [])
  original_node = node(service_pid)
  
  # Kill the node hosting the service
  kill_node(original_node)
  
  # Verify service restarts on another node
  assert_eventually(fn ->
    case Foundation.ProcessManager.lookup_singleton(TestService) do
      {:ok, new_pid} -> node(new_pid) != original_node
      _ -> false
    end
  end, 10_000)
end

4. Performance Characteristics & SLAs

Cluster Formation SLAs

Cluster SizeFormation Time99th Percentile
2-5 nodes< 10 seconds< 15 seconds
6-20 nodes< 30 seconds< 45 seconds
21-100 nodes< 2 minutes< 3 minutes

Message Latency SLAs

Message TypeAverage Latency99th Percentile
Control< 5ms< 20ms
Events< 10ms< 50ms
Telemetry< 50ms< 200ms
Data< 100ms< 500ms

Resource Utilization Targets

Memory Overhead

  • Foundation core: < 50MB per node
  • Per-service overhead: < 1MB per registered service
  • Channel overhead: < 10MB per active channel

CPU Utilization

  • Idle cluster: < 1% CPU for Foundation processes
  • Active messaging: < 5% CPU overhead vs direct Distributed Erlang
  • Cluster formation: < 30% CPU spike, duration < 60 seconds

Network Bandwidth

  • Control messages: < 1KB/message average
  • Heartbeat overhead: < 100 bytes/second per node pair
  • Service discovery: < 10KB per discovery operation

5. Security & Operational Requirements

Security Model

Cluster Authentication

# Foundation must support secure cluster formation
config :foundation,
  cluster: :kubernetes,
  security: [
    erlang_cookie: {:system, "ERLANG_COOKIE"},
    tls_enabled: true,
    certificate_file: "/etc/ssl/certs/cluster.pem"
  ]

Service Authorization

# Services can declare required capabilities
Foundation.ServiceMesh.register_service(
  :payment_service,
  self(),
  [:payment_processing],
  %{security_level: :high, audit_required: true}
)

# Discovery can filter by security requirements
Foundation.ServiceMesh.discover_services(
  capabilities: [:payment_processing],
  security_level: :high
)

Operational Requirements

Health Monitoring

# Foundation must expose health endpoints
Foundation.HealthMonitor.cluster_health()
# => %{
#   status: :healthy | :degraded | :critical,
#   nodes: [{node, status, metrics}],
#   services: [{service, status, instances}],
#   last_check: timestamp
# }

Metrics Collection

# Standard telemetry events Foundation must emit
[:foundation, :cluster, :node_joined] => %{node: atom()}
[:foundation, :cluster, :node_left] => %{node: atom(), reason: term()}
[:foundation, :service, :registered] => %{name: term(), node: atom()}
[:foundation, :service, :deregistered] => %{name: term(), node: atom()}
[:foundation, :message, :sent] => %{channel: atom(), size: integer()}
[:foundation, :message, :received] => %{channel: atom(), latency: integer()}

Logging Standards

# Foundation must provide structured logging
Logger.info("Foundation cluster formed", 
  cluster_id: cluster_id,
  node_count: 3,
  formation_time_ms: 15_432,
  strategy: :kubernetes
)

Logger.warning("Service health check failed",
  service: :user_service,
  node: :"[email protected]",
  consecutive_failures: 3,
  action: :marking_unhealthy
)

Deployment Requirements

Container Support

# Foundation must work in containerized environments
FROM elixir:1.15-alpine

# Required for clustering
RUN apk add --no-cache netcat-openbsd

# Foundation clustering just works
COPY . /app
WORKDIR /app
RUN mix deps.get && mix compile

# Single environment variable enables clustering
ENV FOUNDATION_CLUSTER=kubernetes

CMD ["mix", "run", "--no-halt"]

Kubernetes Integration

# Foundation must integrate with Kubernetes service discovery
apiVersion: apps/v1
kind: Deployment
metadata:
  name: foundation-app
spec:
  selector:
    matchLabels:
      app: foundation-app
  template:
    metadata:
      labels:
        app: foundation-app
    spec:
      containers:
      - name: app
        image: foundation-app:latest
        env:
        - name: FOUNDATION_CLUSTER
          value: "kubernetes"
        - name: FOUNDATION_APP_NAME
          value: "foundation-app"

6. Migration & Compatibility Guide

From Existing libcluster

Zero-Change Migration

# Existing libcluster config continues to work
config :libcluster,
  topologies: [
    k8s: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [...]
    ]
  ]

# Just add Foundation dependency - no config changes needed
{:foundation, "~> 2.0"}

Gradual Enhancement Migration

# Phase 1: Add Foundation, keep libcluster config
{:foundation, "~> 2.0"}

# Phase 2: Start using Foundation services
Foundation.ServiceMesh.register_service(:my_service, self())

# Phase 3: Switch to Foundation clustering
config :foundation, cluster: :kubernetes
# Remove old libcluster config

From Plain OTP/GenServer

Service Registry Migration

# Before: Manual process registry
GenServer.start_link(MyService, [], name: {:global, :my_service})

# After: Foundation service mesh
Foundation.ProcessManager.start_singleton(MyService, [], name: :my_service)

PubSub Migration

# Before: Manual message broadcasting
Enum.each(Node.list(), fn node ->
  send({:my_service, node}, message)
end)

# After: Foundation channels
Foundation.Channels.broadcast(:events, message)

API Compatibility Matrix

Foundation 1.x APIFoundation 2.0 SupportNotes
Foundation.Config.*✅ Full compatibilityAll existing APIs unchanged
Foundation.Events.*✅ Full compatibilityEnhanced with distributed features
Foundation.Telemetry.*✅ Full compatibilityCluster aggregation added
Foundation.ServiceRegistry.*✅ Full compatibilityEnhanced with Horde backend

This technical specification provides the foundation for implementing Foundation 2.0 with confidence, ensuring we build exactly what we’ve envisioned in our synthesis discussions.