Claude’s Graceful Degradation Architecture Fix Plan
Executive Summary
After analyzing the codebase and reviewing the Gemini assessments, I’ve identified a significant architectural flaw in the Foundation’s graceful degradation system. The issue is separation of concerns violation where resilience logic is divorced from the services themselves, creating a brittle, non-cohesive design.
Current Problem: Foundation.Config.GracefulDegradation
exists as a separate module that duplicates and wraps the primary ConfigServer
API, leading to:
- Poor discoverability of fallback behavior
- Tight coupling to implementation details
- Inconsistent resilience across the API surface
- Cognitive overhead for developers
Solution: Integrate resilience directly into the service layer using the Resilient Proxy Pattern.
Detailed Problem Analysis
Current Architecture (Problematic)
Foundation.Config <- Public API
↓ delegates to
Foundation.Services.ConfigServer <- Primary GenServer
↓ fallback managed by
Foundation.Config.GracefulDegradation <- Separate resilience module
Problems Identified:
Split Personality Disorder: Configuration logic lives in THREE places:
Foundation.Config
(public API)Foundation.Services.ConfigServer
(primary implementation)Foundation.Config.GracefulDegradation
(fallback logic)
Hidden Dependencies: Tests show developers must manually call
GracefulDegradation.get_with_fallback/1
to get resilient behavior, meaning the primary API (Config.get/1
) is NOT actually resilient.White-Box Coupling:
GracefulDegradation
knows the exact error signatures returned byConfigServer
, making it brittle to internal changes.Inconsistent API: Only
get/1
andupdate/2
have graceful degradation via separate functions. Other operations likevalidate/1
,available?/0
,status/0
have no fallback mechanisms.
Root Cause Analysis
The fundamental flaw is separation of resilience from responsibility. The ConfigServer should own its entire lifecycle, including how it behaves when unhealthy. By externalizing this concern, we’ve created:
- Discoverability issues: New developers can’t find the fallback logic
- Maintenance burdens: Changes to ConfigServer can break GracefulDegradation
- API confusion: Two different entry points for the same functionality
- Testing complexity: Must test both paths independently
Proposed Solution: Service-Owned Resilience
New Architecture (Correct)
Foundation.Config <- Public API (unchanged)
↓ delegates to
Foundation.Services.ConfigServer <- Resilient Proxy
↓ manages
Foundation.Services.ConfigServer.GenServer <- Internal GenServer
Key Principle: Each service owns its complete behavior contract, including failure modes.
Implementation Strategy
Phase 1: Create Internal GenServer Module
Extract GenServer Implementation
mkdir -p lib/foundation/services/config_server/ mv lib/foundation/services/config_server.ex lib/foundation/services/config_server/gen_server.ex
Update Module Name
# lib/foundation/services/config_server/gen_server.ex defmodule Foundation.Services.ConfigServer.GenServer do use GenServer # ... all existing GenServer logic remains the same end
Phase 2: Create Resilient Proxy
- Implement New ConfigServer Module
# lib/foundation/services/config_server.ex defmodule Foundation.Services.ConfigServer do @moduledoc """ Resilient configuration service that handles both normal operation and graceful degradation in a single, cohesive interface. """ @behaviour Foundation.Contracts.Configurable alias Foundation.Services.ConfigServer.GenServer, as: ConfigGenServer alias Foundation.Types.Error require Logger @fallback_table :config_fallback_cache @cache_ttl 300 # 5 minutes # --- Public API with Built-in Resilience --- @impl Configurable def get() do try_with_fallback(fn -> GenServer.call(ConfigGenServer, :get_config) end, fn -> {:error, create_service_unavailable_error("get/0")} end) end @impl Configurable def get(path) when is_list(path) do try_with_fallback(fn -> case GenServer.call(ConfigGenServer, {:get_config_path, path}) do {:ok, value} = result -> cache_successful_read(path, value) result error -> error end end, fn -> get_from_cache(path) end) end @impl Configurable def update(path, value) when is_list(path) do try_with_fallback(fn -> case GenServer.call(ConfigGenServer, {:update_config, path, value}) do :ok -> clear_pending_update(path) :ok error -> error end end, fn -> cache_pending_update(path, value) end) end # ... other Configurable functions with consistent resilience patterns # --- Lifecycle Management --- def start_link(opts \\ []) do with {:ok, _} <- ConfigGenServer.start_link(opts), :ok <- initialize_fallback_cache() do {:ok, self()} end end def stop, do: ConfigGenServer.stop() # --- Private Resilience Implementation --- defp try_with_fallback(primary_fn, fallback_fn) do primary_fn.() rescue _ -> Logger.warning("ConfigServer unavailable, using fallback") fallback_fn.() catch :exit, _ -> Logger.warning("ConfigServer process dead, using fallback") fallback_fn.() end defp get_from_cache(path) do # Move existing cache logic from GracefulDegradation here end defp cache_successful_read(path, value) do # Cache successful reads for fallback end defp cache_pending_update(path, value) do # Cache failed updates for retry end # ... move all private functions from GracefulDegradation module end
Phase 3: Update Application Supervision
- Update Application Supervisor
# lib/foundation/application.ex @service_definitions %{ # Change from Foundation.Services.ConfigServer to the GenServer config_server: %{ module: Foundation.Services.ConfigServer.GenServer, # Direct GenServer args: [namespace: :production], # ... rest unchanged } }
Phase 4: Eliminate Separate Graceful Degradation
Remove Graceful Degradation Module
rm lib/foundation/graceful_degradation.ex
Update Tests
- Remove direct calls to
GracefulDegradation.get_with_fallback/1
- Test resilience through the primary
Config.get/1
API - Verify all
Configurable
functions have resilient behavior
- Remove direct calls to
Benefits of This Approach
1. Single Responsibility Restoration
ConfigServer
owns ALL configuration behavior: success AND failure pathsConfig
remains a simple public API facade- No hidden modules with critical logic
2. Improved Discoverability
- All configuration behavior discoverable in one module
- Fallback logic is explicit in each function implementation
- No need to hunt for separate resilience modules
3. Loose Coupling
- Internal
GenServer
becomes implementation detail - Public proxy interface shields callers from internal changes
- GenServer can be refactored freely without breaking external modules
4. Consistent Resilience
- ALL
Configurable
functions can have appropriate fallback strategies - No confusion about which functions are resilient vs brittle
- Uniform error handling and logging across all operations
5. Better Testing
- Single entry point to test both success and failure scenarios
- No need to test multiple modules for configuration functionality
- Simplified mocking and error injection
Migration Strategy
Step 1: Parallel Implementation (Safe)
- Create new
ConfigServer.GenServer
module alongside existing - Implement new resilient
ConfigServer
proxy - Run both systems in parallel during testing
Step 2: Gradual Migration (Controlled)
- Update application supervisor to use new GenServer
- Verify all tests pass with new implementation
- Update any direct GenServer references to use proxy
Step 3: Cleanup (Final)
- Remove old
GracefulDegradation
module - Update documentation and examples
- Remove deprecated test scenarios
Implementation Notes
Error Handling Strategy
- Use
try/rescue/catch
for comprehensive error capture - Log all fallback activations for observability
- Provide detailed error context for debugging
- Maintain backward compatibility with existing error formats
Cache Management
- Move ETS table initialization into proxy module
- Use same TTL and cleanup strategies as existing implementation
- Add telemetry for cache hit/miss rates
- Consider adding cache warming strategies
Testing Approach
- Test resilience by killing GenServer process during operations
- Verify cache behavior with various failure scenarios
- Add property-based tests for cache consistency
- Performance testing to ensure no regressions
Observability
- Add telemetry events for fallback activations
- Track cache hit ratios and fallback frequency
- Monitor service recovery times
- Alert on excessive fallback usage
Expected Outcomes
After implementing this fix:
- Developers will naturally discover resilient behavior when reading
ConfigServer
code - Service changes won’t break external resilience assumptions due to proper encapsulation
- All configuration operations will have consistent error handling
- Cognitive load will decrease with single-module ownership of functionality
- API surface will be more trustworthy with uniform resilience patterns
This represents a fundamental architectural improvement that addresses the root cause of the graceful degradation “smell” while maintaining all existing functionality and improving the developer experience.