You are absolutely right to elevate this from a “minor” issue to a significant one. While the system works, this architectural choice represents a fundamental design flaw that will cause escalating problems related to maintainability, cognitive load, and brittleness as the codebase grows. It’s a classic example of a “code smell” that indicates a deeper structural problem.
Let’s do a full, detailed breakdown of the issues and then lay out a concrete, step-by-step plan to address it properly.
Part 1: A Deep Dive into the Flaw
The core problem is that the resilience logic (what to do when ConfigServer
is down) is completely divorced from the service itself. This creates several major issues:
1. Violation of the Single Responsibility Principle (SRP) & Cohesion
A service should be responsible for its entire lifecycle and behavior, including its failure modes. The ConfigServer
’s responsibility is “to provide configuration.” This implicitly includes “what to do when configuration cannot be provided normally.”
- Current State (Low Cohesion):
Foundation.Services.ConfigServer
: Knows how to provide config when it’s healthy.Foundation.GracefulDegradation
: Knows how to provide config whenConfigServer
is unhealthy.Foundation.Config
: The public API that has to be aware of both of the above modules to function correctly.
This is a fractured design. The logic for a single conceptual unit (configuration) is spread across three unrelated modules.
2. Tight Coupling and Brittleness
The GracefulDegradation
module is dangerously coupled to the implementation details and failure modes of ConfigServer
. It’s a “white-box” dependency.
- The Problem: The
get_with_fallback
function knows thatConfig.get()
will return{:error, _}
when the server is down. What if a future developer changesConfigServer
to raise an exception on failure instead? TheGracefulDegradation
module would instantly break because itscase
statement wouldn’t match. It’s relying on a specific failure signature, not a public contract. This makes the system brittle and resistant to refactoring.
3. Increased Cognitive Load & Poor Discoverability
Imagine you are a new developer tasked with understanding how configuration is handled.
- You start at the public API:
Foundation.Config.get/1
. - You see it delegates to
Foundation.Services.ConfigServer.get/1
. - You read
ConfigServer
and see it’s aGenServer
that handles a:get_config_path
call. You think you understand the whole picture.
You would have no way of knowing that a completely separate module, GracefulDegradation
, exists and contains critical fallback logic. You might make changes assuming there is no resilience, or you might spend hours debugging why a config value is still being returned even after you killed the server. The current design is not intuitive or self-documenting.
4. Incomplete Resilience Strategy
The fallback logic is applied inconsistently. GracefulDegradation
provides fallbacks for get
and update
, but what about other functions on the Configurable
behaviour, like reset/0
or updatable_paths/0
? They don’t have a fallback mechanism. This leads to an unpredictable API where some functions are resilient and others are not, violating the Principle of Least Surprise.
Part 2: The Refactoring Plan: The Resilient Service Proxy
The solution is to make each service responsible for its own resilience. The GracefulDegradation
module must be eliminated, and its logic moved into the service layer itself. The best way to do this without cluttering the GenServer
is by using a Resilient Proxy pattern.
The Goal
The Foundation.Services.ConfigServer
module will become the single source of truth for all configuration logic, both for the happy path and for failure scenarios. The client should never have to know whether the GenServer
is up or if it’s reading from a cache.
Step 1: Separate the GenServer from its API
The first step is to break the ConfigServer
module in two.
- Create
Foundation.Services.ConfigServer.Server
: Rename the existingConfigServer
GenServer
implementation to this new, internal module name. This module’s only job is to be the statefulGenServer
. - Make
Foundation.Services.ConfigServer
the Proxy: The original module will no longer be aGenServer
. It will become a stateless proxy module that contains the public-facing functions (get/1
,update/2
, etc.) and the resilience logic.
Step 2: Implement the Resilient Proxy Logic
Now, move the logic from GracefulDegradation
into the new ConfigServer
proxy module.
File: lib/foundation/services/config_server.ex
(The New Proxy)
defmodule Foundation.Services.ConfigServer do
@moduledoc """
Resilient proxy for the configuration service.
This module is the single public entrypoint for configuration. It handles
the primary logic of calling the ConfigServer.Server GenServer and implements
all fallback and caching logic if the primary service is unavailable.
"""
# Behave like the contract
@behaviour Foundation.Contracts.Configurable
alias Foundation.Services.ConfigServer.Server # The actual GenServer
alias Foundation.Types.Error
@fallback_table :config_fallback_cache
@cache_ttl 300 # 5 minutes
# This is no longer a GenServer, so no start_link here.
# The application supervisor will now start ConfigServer.Server directly.
@impl Configurable
def get(path) when is_list(path) do
# Try the primary service first
case GenServer.call(Server, {:get_config_path, path}) do
{:ok, value} ->
# Success! Cache the value and return it.
cache_value(path, value)
{:ok, value}
{:error, _reason} ->
# Primary failed. Log it and try the fallback.
Logger.warning("ConfigServer is unavailable. Attempting to read from fallback cache.")
get_from_cache(path)
end
rescue
# GenServer.call will raise if the process is not alive
_ ->
Logger.warning("ConfigServer process is not alive. Reading from fallback cache.")
get_from_cache(path)
end
# ... other functions like update/2, reset/0 would follow the same pattern ...
# e.g., try GenServer, on failure/exception, use fallback logic.
# --- Fallback Logic (moved from GracefulDegradation) ---
defp get_from_cache(path) do
cache_key = {:config_cache, path}
case :ets.lookup(@fallback_table, cache_key) do
[{^cache_key, value, timestamp}] ->
if (System.system_time(:second) - timestamp) <= @cache_ttl do
{:ok, value}
else
:ets.delete(@fallback_table, cache_key)
create_unavailable_error(path, "Cache expired")
end
[] ->
create_unavailable_error(path, "Service and cache unavailable")
end
end
defp cache_value(path, value) do
:ets.insert(@fallback_table, {{:config_cache, path}, value, System.system_time(:second)})
end
defp create_unavailable_error(path, message) do
# ... logic to create a proper Error struct ...
end
end
Step 3: Update the Call Chain
The application’s call chain becomes much cleaner and more logical.
Old, Confusing Path:
- Caller uses
GracefulDegradation.get_with_fallback/1
. GracefulDegradation
callsFoundation.Config.get/1
.Foundation.Config
callsFoundation.Services.ConfigServer.get/1
.ConfigServer
(GenServer) handles the call.- On failure, the error propagates back up to
GracefulDegradation
, which then checks its ETS cache.
- Caller uses
New, Cohesive Path:
- Caller uses the public API:
Foundation.Config.get/1
. Foundation.Config
delegates toFoundation.Services.ConfigServer.get/1
(the Proxy).- The Proxy attempts to
GenServer.call
theConfigServer.Server
. - If the call fails, the Proxy itself handles the fallback logic (checking ETS).
- Caller uses the public API:
The resilience is now an implementation detail of the service, hidden behind its public API, which is exactly where it belongs.