COUPLINGS 02

Documentation for COUPLINGS_02 from the Ds ex repository.

Excellent. Based on the coupling analysis and the goal of creating a testable, mockable system in DSPEx, here is a detailed plan for integrating the highly coupled components.

The core challenge is that components like Teleprompter depend on the emergent behavior of a full stack (Program -> Predict -> Adapter -> Client). Directly porting this would create a testing nightmare.

The solution is a Contract-First, Bottom-Up Integration Plan. We will define clear interfaces (behaviours in Elixir) for each component, allowing them to be mocked and tested in isolation before being assembled.

Guiding Principles

Contracts are King: Every major component (Client, Adapter, Program, Evaluator, Teleprompter) will implement a formal behaviour. This is the foundation for mockability.
Mock with Mox: We will leverage Mox to create mock implementations of these behaviours for fast, deterministic unit and integration tests.
Isolate, then Integrate: We will test each component against mocked versions of its dependencies before integrating them into a larger whole.
Bottom-Up Implementation: We will build from the foundational layers (Client, Adapter) up to the orchestration layers (Teleprompter).

The Mocking & Testing Harness

Before integration, we’ll establish a “harness” in our test environment (test/support/ or similar).

1. Define Mockable Behaviours:

Create behaviour modules for each key component. These define the “contract” that both real and mock implementations must adhere to.

# In lib/dspex/client.ex
defmodule DSPEx.Client do
  @callback request(messages :: list(), opts :: keyword()) :: {:ok, map()} | {:error, any()}
end

# In lib/dspex/adapter.ex
defmodule DSPEx.Adapter do
  @callback format(signature :: module(), inputs :: map(), demos :: list()) :: {:ok, list()} | {:error, any()}
  @callback parse(signature :: module(), response :: map()) :: {:ok, map()} | {:error, any()}
end

# In lib/dspex/program.ex
defmodule DSPEx.Program do
  @callback forward(program :: struct(), inputs :: map(), opts :: keyword()) :: {:ok, DSPEx.Prediction.t()} | {:error, any()}
end

# In lib/dspex/evaluate.ex
defmodule DSPEx.Evaluator do
  @callback run(program :: module(), dataset :: list(DSPEx.Example.t()), metric_fn :: fun(), opts :: keyword()) :: {:ok, float()} | {:error, any()}
end

# In lib/dspex/teleprompter.ex
defmodule DSPEx.Teleprompter do
  @callback compile(student :: module(), opts :: keyword()) :: {:ok, module()} | {:error, any()}
end

2. Configure Mox for Mocking:

In test/test_helper.exs, we’ll define our mocks.

# test/test_helper.exs
Mox.defmock(MockClient, for: DSPEx.Client)
Mox.defmock(MockAdapter, for: DSPEx.Adapter)
Mox.defmock(MockProgram, for: DSPEx.Program)
Mox.defmock(MockEvaluator, for: DSPEx.Evaluator)
Mox.defmock(MockTeleprompter, for: DSPEx.Teleprompter)

ExUnit.start()

This setup allows any test to call Mox.expect on these mocks, defining their behavior for a specific test case.

Phased Integration Plan

Phase 1: Implement the Execution Core (`Client` and `Adapter`)

Goal: To make DSPEx.Predict work by successfully making a request cycle using a mocked Client.

Steps:

Implement DSPEx.Client.OpenAI (and/or others):
- It must implement the @behaviour DSPEx.Client.
- The actual HTTP request logic (using Req) will be in this module.
- Crucial: The client used by a program should be configurable, defaulting to dspy.settings.lm, allowing us to inject a mock during tests.
Implement DSPEx.Adapter.Chat:
- It must implement the @behaviour DSPEx.Adapter.
- The format/3 function will take a Signature and Examples and produce a list of chat messages.
- The parse/2 function will take a raw LLM response map and extract the structured output fields.
Test DSPEx.Predict in Isolation:
- The DSPEx.Predict.forward/3 function orchestrates the Adapter and Client.
- Test Case:
  - Create a test "Predict module orchestrates adapter and client correctly" case.
  - Mock the Client: Mox.expect(MockClient, :request, fn _messages -> {:ok, %{"choices" => [%{"message" => %{"content" => "Answer: 42"}}]}} end)
  - Mock the Adapter: Use Mox.expect for format and parse to ensure they are called correctly.
  - Instantiate DSPEx.Predict with the mock client.
  - Call forward/3 on the Predict module.
  - Verify that the final output is the correctly parsed map, e.g., %{answer: "42"}.

Outcome of Phase 1: We have a working Predict module whose external dependencies (Client) and internal logic (Adapter) can be fully mocked and verified.

Phase 2: Implement the Evaluation Loop (`Evaluate`)

Goal: To create an Evaluator that can score a Program’s performance on a dataset, testable without running a real program.

Steps:

Implement DSPEx.Evaluate:
- It must implement the @behaviour DSPEx.Evaluator.
- The run/4 function will take a Program, a dataset, and a metric_fn.
- It will use Task.async_stream to call Program.forward/3 for each example in the dataset concurrently.
- It will aggregate the scores from the metric_fn.
Test DSPEx.Evaluate with a Mock Program:
- Test Case:
  - Mox.stub_with(MockProgram, DSPEx.Program.Stub) if needed or just use Mox.expect.
  - Define the behavior of the mock program:
```
# Expect two calls to forward, one succeeding, one failing the metric
Mox.expect(MockProgram, :forward, 2, fn _program, %{question: "q1"}, _opts -> {:ok, %DSPEx.Prediction{outputs: %{answer: "correct"}}} end)
```
  - Define a simple metric_fn like fn example, prediction -> if example.answer == prediction.outputs.answer, do: 1.0, else: 0.0 end.
  - Call DSPEx.Evaluate.run/4 with the MockProgram and a two-example dataset.
  - Assert that the final score is 0.5.

Outcome of Phase 2: We have a working, concurrent Evaluator whose core logic (concurrency, aggregation) can be tested independently of the program it’s evaluating.

Phase 3: Implement the Optimization Engine (`Teleprompter`)

Goal: To implement a Teleprompter (e.g., BootstrapFewShot) whose complex orchestration logic can be tested without any real LLM calls or slow evaluations.

Steps:

Implement DSPEx.Teleprompter.BootstrapFewShot:
- It must implement the @behaviour DSPEx.Teleprompter.
- The compile/2 function will:
  - Take a student and teacher Program.
  - Run the teacher over a trainset to generate potential demonstrations.
  - Use a mocked or real Evaluator to score these demonstrations.
  - Select the best demonstrations based on the metric score.
  - Return a new student program with these demos attached.
Test the Teleprompter with a Fully Mocked Stack:
- This is the most critical test, demonstrating the power of the contract-first approach.
- Test Case:
  - Mock the teacher program: Define its forward behavior to produce predictable “good” (high-score) and “bad” (low-score) outputs for different inputs.
  - Mock the Evaluator: Instead of a real evaluation, define its run behavior to return pre-defined scores. This makes the test instantaneous.
```
# Tell the mock evaluator to return a high score for the "good" program, low for "bad"
Mox.expect(MockEvaluator, :run, fn program, _dataset, _metric, _opts ->
  if program_is_good?(program), do: {:ok, 0.9}, else: {:ok, 0.1}
end)
```
  - Instantiate the BootstrapFewShot teleprompter, passing it the mocked evaluator.
  - Call compile/2 with the student and the mocked teacher.
  - Inspect the returned optimized_student. Assert that its .demos field contains only the demonstrations that would have been generated by the “good” outputs from the mocked teacher.

Outcome of Phase 3: We have a fully testable optimization engine. We can verify its complex selection and compilation logic without waiting for LLM calls or evaluations, leading to rapid development and debugging.

Benefits of this Approach

Testability and Mockability: Each component can be unit-tested in isolation, drastically reducing test complexity and execution time.
Developer Productivity: Developers can work on different components (Adapter, Teleprompter) in parallel, relying on the established behaviour contracts.
Robustness: The formal contracts make the system’s internal API clear and reduce integration errors.
Performance: The core evaluation loop in DSPEx.Evaluate is built for concurrency from day one, leveraging the BEAM’s strengths.
Clarity: The separation of concerns is enforced by the contracts, making the codebase easier to understand and maintain.

Guiding Principles

The Mocking & Testing Harness

Phased Integration Plan

Phase 1: Implement the Execution Core (Client and Adapter)

Phase 2: Implement the Evaluation Loop (Evaluate)

Phase 3: Implement the Optimization Engine (Teleprompter)

Benefits of this Approach

Phase 1: Implement the Execution Core (`Client` and `Adapter`)

Phase 2: Implement the Evaluation Loop (`Evaluate`)

Phase 3: Implement the Optimization Engine (`Teleprompter`)