← Back to Docs20250627

038 GEMINI GRACEFUL FIX PLAN 02

Documentation for 038_GEMINI_GRACEFUL_FIX_PLAN_02 from the Foundation repository.

Of course. This is an excellent request, as it gets to the heart of why this architectural change is so important. Arguing for a refactor is much more effective when you can clearly articulate the future pain points the team will avoid.

Here is a detailed analysis of the maintainability issues that would arise from the current architecture, contrasted with the benefits of the proposed resilient proxy pattern, with explicit references to the code.


Analysis of Maintainability Issues

The proposed refactoring isn’t just a style preference; it’s a crucial step to prevent a class of escalating problems that make a system difficult and expensive to maintain.

1. The “Hidden Logic” Problem (Low Discoverability)

Current Issue: Critical business logic (how to handle configuration when the system is down) is hidden in a module that is not obviously related to configuration.

  • Scenario (Without the Fix): A new developer, Alice, is tasked with adding a new configuration value. She correctly traces the public API: Foundation.Config.get/1 -> Foundation.Services.ConfigServer.get/1. She reads the ConfigServer module and understands how it retrieves data from its state.

    Later, a bug report comes in: “Configuration is being served even when the ConfigServer process has crashed.” Alice spends hours debugging. She kills the ConfigServer process, but her test code still receives a config value. She sets traces on ConfigServer and Config, but they don’t fire. She is completely stumped because she has no reason to look inside foundation/graceful_degradation.ex. The logic is non-obvious and disconnected from the primary module responsible for the feature.

  • How the Fix Improves This (High Discoverability): With the resilient proxy pattern, the call chain is Foundation.Config.get/1 -> Foundation.Services.ConfigServer.get/1. When Alice opens foundation/services/config_server.ex, she immediately sees the entire story in one place:

    # foundation/services/config_server.ex (The New Proxy)
    def get(path) do
      # --- HAPPY PATH ---
      case GenServer.call(Server, {:get_config_path, path}) do
        {:ok, value} ->
          cache_value(path, value) # Caching logic is right here
          {:ok, value}
        # --- FAILURE PATH ---
        {:error, _} = error ->
          # The fallback logic is explicit and discoverable within the same function.
          Logger.warning(...)
          get_from_cache(path, error)
      end
    rescue
      # --- CATASTROPHIC FAILURE PATH ---
      # The logic for when the process doesn't even exist is also here.
      _ ->
        Logger.warning(...)
        get_from_cache(path, ...)
    end
    

    The entire behavior of get/1, both success and failure, is self-contained and immediately understandable. The cognitive load is drastically reduced.

2. The “Brittle Contract” Problem (Tight Coupling)

Current Issue: The GracefulDegradation module is tightly coupled to the specific error signature of ConfigServer, not to a stable contract.

  • Scenario (Without the Fix): A senior developer, Bob, decides to refactor the ConfigServer for better performance. He changes the handle_call for :get_config_path to raise a KeyError exception if the path doesn’t exist, believing that nil paths are an exceptional circumstance.

    # In foundation/services/config_server.ex (The OLD GenServer)
    def handle_call({:get_config_path, path}, _from, state) do
      # Bob's "improvement"
      value = get_in(state.config, path) || raise KeyError, "path not found: #{inspect(path)}"
      {:reply, {:ok, value}, state}
    end
    

    The ConfigServer’s own tests all pass. However, the system now has a critical, hidden bug. The GracefulDegradation module’s case statement will never be triggered by this exception:

    # In foundation/graceful_degradation.ex
    def get_with_fallback(path) do
      case Config.get(path) do
        # This {:error, _} branch will NEVER be hit now for a missing path.
        # The code will instead crash with an unhandled KeyError.
        {:ok, value} -> #...
        {:error, _reason} -> get_from_cache(path)
      end
    end
    

    The resilience logic, which was supposed to make the system more robust, has made it more brittle. A seemingly unrelated change in one module silently breaks another.

  • How the Fix Improves This (Loose Coupling & Encapsulation): With the proxy pattern, the ConfigServer.Server (the GenServer) is a private implementation detail. Only the ConfigServer (the proxy) interacts with it.

    If Bob makes the same change to ConfigServer.Server, the crash is immediately caught and handled inside the proxy module itself.

    # foundation/services/config_server.ex (The New Proxy)
    def get(path) do
      GenServer.call(Server, {:get_config_path, path})
    rescue
      # Bob's KeyError is caught HERE, in the correct module.
      # The rest of the system is completely shielded from this implementation change.
      KeyError ->
        Logger.warning("Path not found in ConfigServer. Reading from cache.")
        get_from_cache(...)
      # Catches other failures, like the process being dead.
      _ ->
        get_from_cache(...)
    end
    

    The public-facing module now enforces the contract, allowing the internal GenServer to be refactored freely without fear of breaking distant, unrelated modules.

3. The “Inconsistent Resilience” Problem (Partial Implementation)

Current Issue: The fallback logic is applied inconsistently across the Configurable API.

  • Scenario (Without the Fix): A developer needs to dynamically check which configuration paths are allowed to be updated at runtime. They correctly use the public API: Foundation.Config.updatable_paths/0.

    # This call path has NO resilience.
    paths = Foundation.Config.updatable_paths()
    

    This delegates directly to ConfigServer, which delegates to ConfigLogic. If the ConfigServer process is down, this call will crash. There is no updatable_paths_with_fallback function in GracefulDegradation. The developer has to know which functions are resilient and which are not. This is a violation of the Principle of Least Surprise and makes the API untrustworthy.

  • How the Fix Improves This (Cohesive Implementation): With the proxy pattern, it becomes natural and easy to provide a resilience strategy for every single function in the Configurable behaviour, because they all live in the same proxy module.

    # foundation/services/config_server.ex (The New Proxy)
    
    # get/1 has fallback logic.
    @impl Configurable
    def get(path), do: # ... logic with fallback
    
    # update/2 has fallback logic.
    @impl Configurable
    def update(path, value), do: # ... logic with fallback
    
    # updatable_paths/0 can now have its own, appropriate resilience strategy.
    # In this case, it might be safe to return a cached or even a default empty list.
    @impl Configurable
    def updatable_paths() do
      case GenServer.call(Server, :updatable_paths) do
        {:ok, paths} ->
          cache_value(:updatable_paths, paths)
          paths
        _ ->
          # It's better to return an empty list than to crash.
          Logger.warning("ConfigServer unavailable. Returning empty list for updatable_paths.")
          get_from_cache(:updatable_paths, []) |> elem(1) # get value or default
      end
    rescue
      _ ->
        Logger.warning("ConfigServer process is dead. Returning empty list for updatable_paths.")
        get_from_cache(:updatable_paths, []) |> elem(1)
    end
    

    The API becomes consistent and predictable. The developer doesn’t need to guess which functions are safe to call during an outage; the service itself makes a sensible decision for each one.