009 8of10

Documentation for 009_8of10 from the Ds ex repository.

Of course. Here is a detailed technical document for the eighth proposed feature enhancement: “Configurable GracefulDegradation Strategies.”

Technical Specification: Configurable Graceful Degradation Strategies

Document Version: 1.0 Author: AI Assistant Status: PROPOSED

1. Overview

This document specifies a new, high-level module, Foundation.GracefulDegradation, designed to provide a clean, declarative, and composable pattern for handling failures in service interactions. While CircuitBreaker and retry logic handle transient failures, this module addresses semantic failures and provides a structured way to define fallback strategies.

The primary driver for this feature is the need for DSPEx programs to be highly resilient and cost-effective. When a primary, high-capability (and high-cost) LLM like gpt-4o is unavailable or failing, a DSPEx program shouldn’t just fail. It should be able to gracefully degrade to a secondary, cheaper model like gpt-4o-mini, or even fall back to a cached result or a simple, deterministic response. This module will provide the generic building blocks for such patterns.

The core of this proposal is a new function, GracefulDegradation.execute_with_strategy/2, that executes a list of functions in order, stopping at the first success.

2. Problem Statement & Use Case

A DSPEx application has a critical summarization feature. The ideal user experience is provided by gpt-4o.

The Failure Scenarios:

The OpenAI API for gpt-4o is experiencing an outage. The circuit breaker for this service is open.
The API is up, but it’s overloaded, and requests are timing out.
The API returns a valid response, but the content is malformed, causing the DSPEx.Adapter to fail parsing.

In all these cases, the DSPEx program call would currently return an {:error, ...} tuple. The application needs a structured way to define what should happen next. The desired behavior is:

Try: Use gpt-4o.
If that fails, then try: Use the cheaper, possibly lower-quality gpt-4o-mini.
If that also fails, then try: Look up the result in a long-term cache.
If all else fails, then: Return a static, apologetic message.

Implementing this logic directly in the application code leads to deeply nested case statements, making the code hard to read, maintain, and test.

3. Proposed API and Architecture

3.1. `Foundation.GracefulDegradation.execute_with_strategy/2` (New Function)

This will be the sole public function in the new module. It embodies the “Chain of Responsibility” design pattern in a functional way.

File: lib/foundation/graceful_degradation.ex

New Function:

defmodule Foundation.GracefulDegradation do
  @moduledoc """
  Provides a structured way to execute operations with fallback strategies.
  """

  @type strategy_fun :: (-> {:ok, term()} | {:error, term()})
  @type strategy_list :: [strategy_fun()]

  @doc """
  Executes a list of strategy functions in order, returning the result
  of the first function that returns `{:ok, value}`.

  If all functions in the list return `{:error, reason}`, this function
  returns the error from the *last* attempted strategy.

  ## Parameters
    - `strategies`: A list of 0-arity functions, each representing a
      step in the fallback chain.
    - `opts`: A keyword list of options.
      - `:on_fallback`: An optional `(error, next_strategy_index) -> :ok`
        callback that is triggered when a strategy fails and a fallback
        is about to be attempted. Useful for logging or telemetry.

  ## Returns
    - `{:ok, value}` from the first successful strategy.
    - `{:error, reason}` from the last strategy if all fail.
  """
  @spec execute_with_strategy(strategies :: strategy_list(), opts :: keyword()) ::
    {:ok, term()} | {:error, term()}
  def execute_with_strategy(strategies, opts \\ []) do
    on_fallback_callback = Keyword.get(opts, :on_fallback)
    do_execute(strategies, [], on_fallback_callback, 0)
  end

  # --- Private Implementation ---

  defp do_execute([], accumulated_errors, _on_fallback, _index) do
    # All strategies failed, return the last error.
    List.last(accumulated_errors, {:error, :no_strategies_provided})
  end

  defp do_execute([current_strategy | remaining_strategies], accumulated_errors, on_fallback, index) do
    case current_strategy.() do
      {:ok, _} = success ->
        success

      {:error, _} = error ->
        # Current strategy failed. Trigger callback and try the next one.
        if is_function(on_fallback, 2) do
          on_fallback.(error, index + 1)
        end
        
        do_execute(remaining_strategies, accumulated_errors ++ [error], on_fallback, index + 1)
    end
  end
end

Key Design Points:

Simplicity: The API is a single function that takes a list of anonymous functions. This is highly composable and easy to understand.
Statelessness: The module itself is stateless. All state is contained within the closures provided by the user.
Observability: The :on_fallback callback provides a crucial hook for logging and telemetry, allowing developers to monitor when their systems are degrading to secondary or tertiary strategies.

4. Example Usage in a `DSPEx` Module

Here’s how a DSPEx program would use this new utility to implement the desired fallback logic.

defmodule MyApp.Summarizer do
  @behaviour DSPEx.Program

  # Assume clients for both models are started and registered
  @gpt4o_client :my_gpt4o_client
  @gpt4o_mini_client :my_gpt4o_mini_client
  
  defstruct [:signature]

  def forward(program, inputs) do
    # Define the fallback strategies in order of preference.
    strategies = [
      # Strategy 1: Try the primary, high-quality model.
      fn ->
        predict_with_client(program, inputs, @gpt4o_client)
      end,
      
      # Strategy 2: If that fails, try the cheaper, faster model.
      fn ->
        predict_with_client(program, inputs, @gpt4o_mini_client)
      end,
      
      # Strategy 3: If that also fails, try to find a cached response.
      fn ->
        # This assumes a Cache module that can be queried.
        DSPEx.Cache.get_summary(inputs.article_text)
      end,

      # Strategy 4: The final, deterministic fallback.
      fn ->
        {:ok, %DSPEx.Prediction{summary: "We are unable to provide a summary at this time. Please try again later."}}
      end
    ]

    # Define a callback for observability.
    on_fallback = fn {:error, reason}, index ->
      Logger.warn("Summarizer strategy failed, falling back to strategy ##{index}. Reason: #{inspect reason}")
      Telemetry.emit_counter([:dspex, :summarizer, :fallback], %{
        failed_strategy_index: index - 1
      })
    end
    
    # Execute the pipeline.
    Foundation.GracefulDegradation.execute_with_strategy(strategies, on_fallback: on_fallback)
  end
  
  # Helper function to reduce code duplication
  defp predict_with_client(program, inputs, client_name) do
    # This reuses the existing Predict module logic
    predictor = %DSPEx.Predict{signature: program.signature, client: client_name}
    DSPEx.Program.forward(predictor, inputs)
  end
end

Benefits of this approach:

Declarative Strategy: The strategies list clearly and declaratively defines the fallback chain. The core business logic is not cluttered with nested case statements.
Clean Separation: The predict_with_client/3 helper encapsulates the core DSPEx call, while the forward/2 function is responsible for orchestrating the resilience strategy.
Testability: Each strategy function can be tested in isolation. You can easily test the fallback logic by providing mock functions that are designed to fail.

5. Telemetry and Events

The GracefulDegradation module itself is generic and should not emit application-specific telemetry. Instead, it provides the :on_fallback hook to empower the application to emit its own, context-rich telemetry.

As shown in the example, the DSPEx summarizer module would emit a metric like [:dspex, :summarizer, :fallback]. This allows operators to build dashboards and alerts that answer critical questions like:

“How often is our primary gpt-4o model failing?”
“Are we spending more on the cheaper gpt-4o-mini than expected due to fallbacks?”
“Which errors are causing the fallbacks?” (by inspecting the reason in the callback).

6. Conclusion

The Foundation.GracefulDegradation module provides a simple yet powerful pattern for building resilient systems. It moves failure handling from a reactive, nested control-flow problem into a proactive, declarative strategy definition.

For DSPEx, this is a transformative feature. It enables developers to build AI applications that are not only powerful but also robust, cost-aware, and fault-tolerant by design. It provides the tools to handle the inherent unreliability of external network services gracefully, ensuring a better user experience and more stable production systems. This enhancement is a crucial step in making DSPEx a truly production-ready framework.