← Back to Docs20250627

031 GEMINI REVIEW 20250625 1630 03 GRACEFUL SMELL

Documentation for 031_GEMINI_REVIEW_20250625_1630_03_GRACEFUL_SMELL from the Foundation repository.

You are absolutely right to elevate this from a “minor” issue to a significant one. While the system works, this architectural choice represents a fundamental design flaw that will cause escalating problems related to maintainability, cognitive load, and brittleness as the codebase grows. It’s a classic example of a “code smell” that indicates a deeper structural problem.

Let’s do a full, detailed breakdown of the issues and then lay out a concrete, step-by-step plan to address it properly.

Part 1: A Deep Dive into the Flaw

The core problem is that the resilience logic (what to do when ConfigServer is down) is completely divorced from the service itself. This creates several major issues:

1. Violation of the Single Responsibility Principle (SRP) & Cohesion

A service should be responsible for its entire lifecycle and behavior, including its failure modes. The ConfigServer’s responsibility is “to provide configuration.” This implicitly includes “what to do when configuration cannot be provided normally.”

  • Current State (Low Cohesion):
    • Foundation.Services.ConfigServer: Knows how to provide config when it’s healthy.
    • Foundation.GracefulDegradation: Knows how to provide config when ConfigServer is unhealthy.
    • Foundation.Config: The public API that has to be aware of both of the above modules to function correctly.

This is a fractured design. The logic for a single conceptual unit (configuration) is spread across three unrelated modules.

2. Tight Coupling and Brittleness

The GracefulDegradation module is dangerously coupled to the implementation details and failure modes of ConfigServer. It’s a “white-box” dependency.

  • The Problem: The get_with_fallback function knows that Config.get() will return {:error, _} when the server is down. What if a future developer changes ConfigServer to raise an exception on failure instead? The GracefulDegradation module would instantly break because its case statement wouldn’t match. It’s relying on a specific failure signature, not a public contract. This makes the system brittle and resistant to refactoring.

3. Increased Cognitive Load & Poor Discoverability

Imagine you are a new developer tasked with understanding how configuration is handled.

  1. You start at the public API: Foundation.Config.get/1.
  2. You see it delegates to Foundation.Services.ConfigServer.get/1.
  3. You read ConfigServer and see it’s a GenServer that handles a :get_config_path call. You think you understand the whole picture.

You would have no way of knowing that a completely separate module, GracefulDegradation, exists and contains critical fallback logic. You might make changes assuming there is no resilience, or you might spend hours debugging why a config value is still being returned even after you killed the server. The current design is not intuitive or self-documenting.

4. Incomplete Resilience Strategy

The fallback logic is applied inconsistently. GracefulDegradation provides fallbacks for get and update, but what about other functions on the Configurable behaviour, like reset/0 or updatable_paths/0? They don’t have a fallback mechanism. This leads to an unpredictable API where some functions are resilient and others are not, violating the Principle of Least Surprise.


Part 2: The Refactoring Plan: The Resilient Service Proxy

The solution is to make each service responsible for its own resilience. The GracefulDegradation module must be eliminated, and its logic moved into the service layer itself. The best way to do this without cluttering the GenServer is by using a Resilient Proxy pattern.

The Goal

The Foundation.Services.ConfigServer module will become the single source of truth for all configuration logic, both for the happy path and for failure scenarios. The client should never have to know whether the GenServer is up or if it’s reading from a cache.

Step 1: Separate the GenServer from its API

The first step is to break the ConfigServer module in two.

  1. Create Foundation.Services.ConfigServer.Server: Rename the existing ConfigServer GenServer implementation to this new, internal module name. This module’s only job is to be the stateful GenServer.
  2. Make Foundation.Services.ConfigServer the Proxy: The original module will no longer be a GenServer. It will become a stateless proxy module that contains the public-facing functions (get/1, update/2, etc.) and the resilience logic.

Step 2: Implement the Resilient Proxy Logic

Now, move the logic from GracefulDegradation into the new ConfigServer proxy module.

File: lib/foundation/services/config_server.ex (The New Proxy)

defmodule Foundation.Services.ConfigServer do
  @moduledoc """
  Resilient proxy for the configuration service.

  This module is the single public entrypoint for configuration. It handles
  the primary logic of calling the ConfigServer.Server GenServer and implements
  all fallback and caching logic if the primary service is unavailable.
  """

  # Behave like the contract
  @behaviour Foundation.Contracts.Configurable

  alias Foundation.Services.ConfigServer.Server # The actual GenServer
  alias Foundation.Types.Error

  @fallback_table :config_fallback_cache
  @cache_ttl 300 # 5 minutes

  # This is no longer a GenServer, so no start_link here.
  # The application supervisor will now start ConfigServer.Server directly.

  @impl Configurable
  def get(path) when is_list(path) do
    # Try the primary service first
    case GenServer.call(Server, {:get_config_path, path}) do
      {:ok, value} ->
        # Success! Cache the value and return it.
        cache_value(path, value)
        {:ok, value}

      {:error, _reason} ->
        # Primary failed. Log it and try the fallback.
        Logger.warning("ConfigServer is unavailable. Attempting to read from fallback cache.")
        get_from_cache(path)
    end
  rescue
    # GenServer.call will raise if the process is not alive
    _ ->
      Logger.warning("ConfigServer process is not alive. Reading from fallback cache.")
      get_from_cache(path)
  end

  # ... other functions like update/2, reset/0 would follow the same pattern ...
  # e.g., try GenServer, on failure/exception, use fallback logic.

  # --- Fallback Logic (moved from GracefulDegradation) ---

  defp get_from_cache(path) do
    cache_key = {:config_cache, path}
    case :ets.lookup(@fallback_table, cache_key) do
      [{^cache_key, value, timestamp}] ->
        if (System.system_time(:second) - timestamp) <= @cache_ttl do
          {:ok, value}
        else
          :ets.delete(@fallback_table, cache_key)
          create_unavailable_error(path, "Cache expired")
        end
      [] ->
        create_unavailable_error(path, "Service and cache unavailable")
    end
  end

  defp cache_value(path, value) do
    :ets.insert(@fallback_table, {{:config_cache, path}, value, System.system_time(:second)})
  end

  defp create_unavailable_error(path, message) do
    # ... logic to create a proper Error struct ...
  end
end

Step 3: Update the Call Chain

The application’s call chain becomes much cleaner and more logical.

  • Old, Confusing Path:

    1. Caller uses GracefulDegradation.get_with_fallback/1.
    2. GracefulDegradation calls Foundation.Config.get/1.
    3. Foundation.Config calls Foundation.Services.ConfigServer.get/1.
    4. ConfigServer (GenServer) handles the call.
    5. On failure, the error propagates back up to GracefulDegradation, which then checks its ETS cache.
  • New, Cohesive Path:

    1. Caller uses the public API: Foundation.Config.get/1.
    2. Foundation.Config delegates to Foundation.Services.ConfigServer.get/1 (the Proxy).
    3. The Proxy attempts to GenServer.call the ConfigServer.Server.
    4. If the call fails, the Proxy itself handles the fallback logic (checking ETS).

The resilience is now an implementation detail of the service, hidden behind its public API, which is exactly where it belongs.

Step 4: Delete Foundation.GracefulDegradation