Of course. Here is a detailed technical document for the eighth proposed feature enhancement: “Configurable GracefulDegradation
Strategies.”
Technical Specification: Configurable Graceful Degradation Strategies
Document Version: 1.0 Author: AI Assistant Status: PROPOSED
1. Overview
This document specifies a new, high-level module, Foundation.GracefulDegradation
, designed to provide a clean, declarative, and composable pattern for handling failures in service interactions. While CircuitBreaker
and retry logic handle transient failures, this module addresses semantic failures and provides a structured way to define fallback strategies.
The primary driver for this feature is the need for DSPEx
programs to be highly resilient and cost-effective. When a primary, high-capability (and high-cost) LLM like gpt-4o
is unavailable or failing, a DSPEx
program shouldn’t just fail. It should be able to gracefully degrade to a secondary, cheaper model like gpt-4o-mini
, or even fall back to a cached result or a simple, deterministic response. This module will provide the generic building blocks for such patterns.
The core of this proposal is a new function, GracefulDegradation.execute_with_strategy/2
, that executes a list of functions in order, stopping at the first success.
2. Problem Statement & Use Case
A DSPEx
application has a critical summarization feature. The ideal user experience is provided by gpt-4o
.
The Failure Scenarios:
- The OpenAI API for
gpt-4o
is experiencing an outage. The circuit breaker for this service is open. - The API is up, but it’s overloaded, and requests are timing out.
- The API returns a valid response, but the content is malformed, causing the
DSPEx.Adapter
to fail parsing.
In all these cases, the DSPEx
program call would currently return an {:error, ...}
tuple. The application needs a structured way to define what should happen next. The desired behavior is:
- Try: Use
gpt-4o
. - If that fails, then try: Use the cheaper, possibly lower-quality
gpt-4o-mini
. - If that also fails, then try: Look up the result in a long-term cache.
- If all else fails, then: Return a static, apologetic message.
Implementing this logic directly in the application code leads to deeply nested case
statements, making the code hard to read, maintain, and test.
3. Proposed API and Architecture
3.1. Foundation.GracefulDegradation.execute_with_strategy/2
(New Function)
This will be the sole public function in the new module. It embodies the “Chain of Responsibility” design pattern in a functional way.
File: lib/foundation/graceful_degradation.ex
New Function:
defmodule Foundation.GracefulDegradation do
@moduledoc """
Provides a structured way to execute operations with fallback strategies.
"""
@type strategy_fun :: (-> {:ok, term()} | {:error, term()})
@type strategy_list :: [strategy_fun()]
@doc """
Executes a list of strategy functions in order, returning the result
of the first function that returns `{:ok, value}`.
If all functions in the list return `{:error, reason}`, this function
returns the error from the *last* attempted strategy.
## Parameters
- `strategies`: A list of 0-arity functions, each representing a
step in the fallback chain.
- `opts`: A keyword list of options.
- `:on_fallback`: An optional `(error, next_strategy_index) -> :ok`
callback that is triggered when a strategy fails and a fallback
is about to be attempted. Useful for logging or telemetry.
## Returns
- `{:ok, value}` from the first successful strategy.
- `{:error, reason}` from the last strategy if all fail.
"""
@spec execute_with_strategy(strategies :: strategy_list(), opts :: keyword()) ::
{:ok, term()} | {:error, term()}
def execute_with_strategy(strategies, opts \\ []) do
on_fallback_callback = Keyword.get(opts, :on_fallback)
do_execute(strategies, [], on_fallback_callback, 0)
end
# --- Private Implementation ---
defp do_execute([], accumulated_errors, _on_fallback, _index) do
# All strategies failed, return the last error.
List.last(accumulated_errors, {:error, :no_strategies_provided})
end
defp do_execute([current_strategy | remaining_strategies], accumulated_errors, on_fallback, index) do
case current_strategy.() do
{:ok, _} = success ->
success
{:error, _} = error ->
# Current strategy failed. Trigger callback and try the next one.
if is_function(on_fallback, 2) do
on_fallback.(error, index + 1)
end
do_execute(remaining_strategies, accumulated_errors ++ [error], on_fallback, index + 1)
end
end
end
Key Design Points:
- Simplicity: The API is a single function that takes a list of anonymous functions. This is highly composable and easy to understand.
- Statelessness: The module itself is stateless. All state is contained within the closures provided by the user.
- Observability: The
:on_fallback
callback provides a crucial hook for logging and telemetry, allowing developers to monitor when their systems are degrading to secondary or tertiary strategies.
4. Example Usage in a DSPEx
Module
Here’s how a DSPEx
program would use this new utility to implement the desired fallback logic.
defmodule MyApp.Summarizer do
@behaviour DSPEx.Program
# Assume clients for both models are started and registered
@gpt4o_client :my_gpt4o_client
@gpt4o_mini_client :my_gpt4o_mini_client
defstruct [:signature]
def forward(program, inputs) do
# Define the fallback strategies in order of preference.
strategies = [
# Strategy 1: Try the primary, high-quality model.
fn ->
predict_with_client(program, inputs, @gpt4o_client)
end,
# Strategy 2: If that fails, try the cheaper, faster model.
fn ->
predict_with_client(program, inputs, @gpt4o_mini_client)
end,
# Strategy 3: If that also fails, try to find a cached response.
fn ->
# This assumes a Cache module that can be queried.
DSPEx.Cache.get_summary(inputs.article_text)
end,
# Strategy 4: The final, deterministic fallback.
fn ->
{:ok, %DSPEx.Prediction{summary: "We are unable to provide a summary at this time. Please try again later."}}
end
]
# Define a callback for observability.
on_fallback = fn {:error, reason}, index ->
Logger.warn("Summarizer strategy failed, falling back to strategy ##{index}. Reason: #{inspect reason}")
Telemetry.emit_counter([:dspex, :summarizer, :fallback], %{
failed_strategy_index: index - 1
})
end
# Execute the pipeline.
Foundation.GracefulDegradation.execute_with_strategy(strategies, on_fallback: on_fallback)
end
# Helper function to reduce code duplication
defp predict_with_client(program, inputs, client_name) do
# This reuses the existing Predict module logic
predictor = %DSPEx.Predict{signature: program.signature, client: client_name}
DSPEx.Program.forward(predictor, inputs)
end
end
Benefits of this approach:
- Declarative Strategy: The
strategies
list clearly and declaratively defines the fallback chain. The core business logic is not cluttered with nestedcase
statements. - Clean Separation: The
predict_with_client/3
helper encapsulates the coreDSPEx
call, while theforward/2
function is responsible for orchestrating the resilience strategy. - Testability: Each strategy function can be tested in isolation. You can easily test the fallback logic by providing mock functions that are designed to fail.
5. Telemetry and Events
The GracefulDegradation
module itself is generic and should not emit application-specific telemetry. Instead, it provides the :on_fallback
hook to empower the application to emit its own, context-rich telemetry.
As shown in the example, the DSPEx
summarizer module would emit a metric like [:dspex, :summarizer, :fallback]
. This allows operators to build dashboards and alerts that answer critical questions like:
- “How often is our primary
gpt-4o
model failing?” - “Are we spending more on the cheaper
gpt-4o-mini
than expected due to fallbacks?” - “Which errors are causing the fallbacks?” (by inspecting the
reason
in the callback).
6. Conclusion
The Foundation.GracefulDegradation
module provides a simple yet powerful pattern for building resilient systems. It moves failure handling from a reactive, nested control-flow problem into a proactive, declarative strategy definition.
For DSPEx
, this is a transformative feature. It enables developers to build AI applications that are not only powerful but also robust, cost-aware, and fault-tolerant by design. It provides the tools to handle the inherent unreliability of external network services gracefully, ensuring a better user experience and more stable production systems. This enhancement is a crucial step in making DSPEx
a truly production-ready framework.