← Back to Docs20250627

037 GEMINI GRACEFUL FIX PLAN

Documentation for 037_GEMINI_GRACEFUL_FIX_PLAN from the Foundation repository.

Of course. This is an excellent analysis of the architectural flaw. The proposed solution to refactor the ConfigServer into a resilient proxy is the correct approach to improve cohesion, reduce coupling, and create a more maintainable system.

Here is a detailed, step-by-step plan to execute this fix.

Fix Plan: Refactoring ConfigServer to a Resilient Proxy

The goal is to eliminate the Foundation.GracefulDegradation module by integrating its resilience logic directly into the configuration service layer. This will be achieved by splitting the existing ConfigServer into a GenServer process (Server) and a stateless public-facing API (Proxy), which handles fallback logic.


Phase 1: Preparation & Scaffolding

This phase prepares the file structure and new modules without changing existing logic.

  1. Create New Directory: Create a new directory to house the split modules:

    mkdir -p foundation/services/config_server
    
  2. Move and Rename the GenServer: Move the existing ConfigServer GenServer implementation to the new directory and rename it to server.ex.

    mv foundation/services/config_server.ex foundation/services/config_server/server.ex
    
  3. Update the GenServer Module: In the newly moved file, foundation/services/config_server/server.ex, change the module definition from Foundation.Services.ConfigServer to Foundation.Services.ConfigServer.Server.

    File: foundation/services/config_server/server.ex

    # Change this line
    defmodule Foundation.Services.ConfigServer.Server do
      # ... rest of the file remains the same for now ...
      use GenServer
      # ...
    end
    
  4. Create the New Proxy Module File: Create a new, empty file that will become our resilient proxy.

    touch foundation/services/config_server.ex
    

Phase 2: Implement the Resilient Proxy

This is the core of the refactoring. We will populate the new proxy module with the combined logic of the old ConfigServer API and the GracefulDegradation fallback mechanism.

  1. Delete foundation/graceful_degradation.ex: This module’s logic will be moved, so it is no longer needed.

    rm foundation/graceful_degradation.ex
    
  2. Implement the New ConfigServer Proxy: Populate foundation/services/config_server.ex with the following code. This code implements the Configurable behaviour and contains the fallback logic.

    File: foundation/services/config_server.ex

    defmodule Foundation.Services.ConfigServer do
      @moduledoc """
      Resilient proxy for the configuration service.
    
      This module is the single public entrypoint for configuration. It handles
      the primary logic of calling the ConfigServer.Server GenServer and implements
      all fallback and caching logic if the primary service is unavailable.
      """
    
      @behaviour Foundation.Contracts.Configurable
    
      alias Foundation.Services.ConfigServer.Server
      alias Foundation.Types.{Config, Error}
      require Logger
    
      @fallback_table :config_fallback_cache
      @cache_ttl 300 # 5 minutes
    
      # --- Lifecycle and Fallback System Management ---
    
      @impl Foundation.Contracts.Configurable
      def initialize(opts \\ []) do
        # This function starts the actual GenServer. The proxy itself is stateless.
        case Server.start_link(opts) do
          {:ok, _pid} ->
            initialize_fallback_system()
            :ok
          {:error, {:already_started, _pid}} ->
            initialize_fallback_system()
            :ok
          {:error, reason} ->
            {:error, Error.new(:initialization_failed, "Failed to start ConfigServer: #{inspect(reason)}")}
        end
      end
    
      def initialize_fallback_system do
        case :ets.whereis(@fallback_table) do
          :undefined ->
            :ets.new(@fallback_table, [:named_table, :public, :set, read_concurrency: true])
            Logger.info("Config fallback cache initialized.")
          _ ->
            :ok
        end
      end
    
      def stop(), do: Server.stop()
    
      # --- Resilient API Implementation ---
    
      @impl Foundation.Contracts.Configurable
      def get() do
        GenServer.call(Server, :get_config)
      rescue
        _ -> {:error, create_unavailable_error("get/0", "Service unavailable")}
      end
    
      @impl Foundation.Contracts.Configurable
      def get(path) do
        case GenServer.call(Server, {:get_config_path, path}) do
          {:ok, value} ->
            cache_value(path, value)
            {:ok, value}
          {:error, _} = error ->
            Logger.warning("ConfigServer primary lookup failed for path #{inspect(path)}. Attempting fallback.")
            get_from_cache(path, error)
        end
      rescue
        _ ->
          Logger.warning("ConfigServer process is not available for path #{inspect(path)}. Attempting fallback.")
          get_from_cache(path, create_unavailable_error(path, "Process not available"))
      end
    
      @impl Foundation.Contracts.Configurable
      def update(path, value) do
        case GenServer.call(Server, {:update_config, path, value}) do
          :ok ->
            # Update was successful, clear any pending updates for this path
            :ets.delete(@fallback_table, {:pending_update, path})
            :ok
          {:error, _} = error ->
            # Primary update failed, cache as a pending update
            cache_pending_update(path, value)
            error
        end
      rescue
        _ ->
          cache_pending_update(path, value)
          {:error, create_unavailable_error(path, "Update failed, service unavailable")}
      end
    
      @impl Foundation.Contracts.Configurable
      def reset() do
        # This operation is critical and should probably not have a fallback.
        # If the server is down, a reset cannot be guaranteed.
        GenServer.call(Server, :reset_config)
      rescue
        _ -> {:error, create_unavailable_error("reset/0", "Cannot reset, service unavailable")}
      end
    
      # --- Delegated and Passthrough Functions ---
    
      defdelegate validate(config), to: Server
      defdelegate updatable_paths(), to: Server
      defdelegate available?(), to: Server
      defdelegate status(), to: Server
      defdelegate subscribe(pid \\ self()), to: Server
      defdelegate unsubscribe(pid \\ self()), to: Server
      defdelegate reset_state(), to: Server # For testing
    
      # --- Private Fallback Logic ---
    
      defp get_from_cache(path, original_error) do
        cache_key = {:config_cache, path}
        case :ets.lookup(@fallback_table, cache_key) do
          [{^cache_key, value, timestamp}] when (System.system_time(:second) - timestamp) <= @cache_ttl ->
            Logger.debug("Using cached config value for path: #{inspect(path)}")
            {:ok, value}
          [{^cache_key, _, _}] ->
            :ets.delete(@fallback_table, cache_key)
            Logger.warning("Cache expired for path: #{inspect(path)}")
            {:error, create_unavailable_error(path, "Service unavailable and cache expired", original_error)}
          [] ->
            Logger.warning("No cached value available for path: #{inspect(path)}")
            {:error, create_unavailable_error(path, "Service and cache unavailable", original_error)}
        end
      end
    
      defp cache_value(path, value) do
        :ets.insert(@fallback_table, {{:config_cache, path}, value, System.system_time(:second)})
      end
    
      defp cache_pending_update(path, value) do
        Logger.warning("Caching pending config update for path: #{inspect(path)}")
        :ets.insert(@fallback_table, {{:pending_update, path}, value, System.system_time(:second)})
      end
    
      defp create_unavailable_error(context_info, message, original_error \\ nil) do
        Error.new(
          code: 5000,
          error_type: :service_unavailable,
          message: "Configuration service not available. #{message}.",
          severity: :high,
          context: %{path: context_info, original_error: original_error}
        )
      end
    end
    

Phase 3: Update System Integration

Now, we update the rest of the system to use the new proxy and server structure.

  1. Update Foundation.Application: Modify foundation/application.ex to start the new Foundation.Services.ConfigServer.Server and initialize the fallback cache.

    File: foundation/application.ex

    # In the @service_definitions map:
    # ...
    config_server: %{
      # Change the module to point to the GenServer implementation
      module: Foundation.Services.ConfigServer.Server,
      args: [namespace: :production],
      dependencies: [:process_registry],
      startup_phase: :foundation_services,
      health_check_interval: 30_000,
      restart_strategy: :permanent
    },
    # ...
    
    # In the start/2 function, after setup_application_config/0:
    def start(_type, _args) do
      # ...
      setup_application_config()
    
      # Initialize the fallback system for ConfigServer
      Foundation.Services.ConfigServer.initialize_fallback_system()
    
      # Initialize infrastructure components
      # ...
    end
    
  2. Update Foundation.Config: This is the highest-level public API. It should continue to delegate to Foundation.Services.ConfigServer, which is now our resilient proxy. This change is seamless because the module name Foundation.Services.ConfigServer remains the public entry point. No changes should be needed in foundation/config.ex, but it’s crucial to verify this.

    File: foundation/config.ex

    defmodule Foundation.Config do
      # ...
      alias Foundation.Services.ConfigServer # This now correctly aliases our proxy
    
      # ...
      defdelegate get(), to: ConfigServer # This now calls the proxy
      # ... all other defdelegates are correct.
    end
    
  3. Global Code Search: Perform a global search for GracefulDegradation in the codebase. The only call should have been in Foundation.Config, which we’ve now conceptually replaced. Remove any lingering references. The get_with_fallback function in Foundation.Config can now be safely removed as get/1 is now inherently resilient.

    File: foundation/config.ex

    # This function is now redundant and should be REMOVED.
    # The new `get/1` implementation handles the fallback.
    @spec get_with_default(config_path(), config_value()) :: config_value()
    @dialyzer {:nowarn_function, get_with_default: 2}
    def get_with_default(path, default) do
      case get(path) do
        {:ok, value} -> value
        {:error, _} -> default
      end
    end
    

Phase 4: Verification and Testing

  1. Update Existing Tests:

    • Any tests that were directly testing Foundation.Services.ConfigServer as a GenServer should be updated to test Foundation.Services.ConfigServer.Server.
    • Tests for Foundation.Config should continue to pass without modification.
  2. Create New Tests for the Proxy: Create a new test file: test/foundation/services/config_server_test.exs.

    • Happy Path: Write a test where ConfigServer.Server is started, and verify that ConfigServer.get/1 successfully retrieves values from it.
    • Failure/Fallback Path: Write a test where ConfigServer.Server is not started.
      • Assert that ConfigServer.get/1 returns an :service_unavailable error initially.
      • Manually insert a value into the :config_fallback_cache ETS table.
      • Assert that a subsequent call to ConfigServer.get/1 now returns {:ok, value} from the cache.
    • Cache Expiry Test:
      • Manually insert a value into the cache with a timestamp older than the TTL.
      • Assert that ConfigServer.get/1 returns an error and that the entry has been deleted from ETS.
    • Pending Update Test:
      • Ensure the Server is down.
      • Call ConfigServer.update/2. Assert that it returns an error but that an entry for {:pending_update, path} now exists in the ETS table.

By following this plan, you will successfully refactor the configuration system to be more robust, cohesive, and maintainable, fully addressing the architectural flaw.