Excellent. Based on the coupling analysis and the goal of creating a testable, mockable system in DSPEx, here is a detailed plan for integrating the highly coupled components.
The core challenge is that components like Teleprompter
depend on the emergent behavior of a full stack (Program
-> Predict
-> Adapter
-> Client
). Directly porting this would create a testing nightmare.
The solution is a Contract-First, Bottom-Up Integration Plan. We will define clear interfaces (behaviours
in Elixir) for each component, allowing them to be mocked and tested in isolation before being assembled.
Guiding Principles
- Contracts are King: Every major component (
Client
,Adapter
,Program
,Evaluator
,Teleprompter
) will implement a formalbehaviour
. This is the foundation for mockability. - Mock with
Mox
: We will leverageMox
to create mock implementations of these behaviours for fast, deterministic unit and integration tests. - Isolate, then Integrate: We will test each component against mocked versions of its dependencies before integrating them into a larger whole.
- Bottom-Up Implementation: We will build from the foundational layers (
Client
,Adapter
) up to the orchestration layers (Teleprompter
).
The Mocking & Testing Harness
Before integration, we’ll establish a “harness” in our test environment (test/support/
or similar).
1. Define Mockable Behaviours:
Create behaviour
modules for each key component. These define the “contract” that both real and mock implementations must adhere to.
# In lib/dspex/client.ex
defmodule DSPEx.Client do
@callback request(messages :: list(), opts :: keyword()) :: {:ok, map()} | {:error, any()}
end
# In lib/dspex/adapter.ex
defmodule DSPEx.Adapter do
@callback format(signature :: module(), inputs :: map(), demos :: list()) :: {:ok, list()} | {:error, any()}
@callback parse(signature :: module(), response :: map()) :: {:ok, map()} | {:error, any()}
end
# In lib/dspex/program.ex
defmodule DSPEx.Program do
@callback forward(program :: struct(), inputs :: map(), opts :: keyword()) :: {:ok, DSPEx.Prediction.t()} | {:error, any()}
end
# In lib/dspex/evaluate.ex
defmodule DSPEx.Evaluator do
@callback run(program :: module(), dataset :: list(DSPEx.Example.t()), metric_fn :: fun(), opts :: keyword()) :: {:ok, float()} | {:error, any()}
end
# In lib/dspex/teleprompter.ex
defmodule DSPEx.Teleprompter do
@callback compile(student :: module(), opts :: keyword()) :: {:ok, module()} | {:error, any()}
end
2. Configure Mox
for Mocking:
In test/test_helper.exs
, we’ll define our mocks.
# test/test_helper.exs
Mox.defmock(MockClient, for: DSPEx.Client)
Mox.defmock(MockAdapter, for: DSPEx.Adapter)
Mox.defmock(MockProgram, for: DSPEx.Program)
Mox.defmock(MockEvaluator, for: DSPEx.Evaluator)
Mox.defmock(MockTeleprompter, for: DSPEx.Teleprompter)
ExUnit.start()
This setup allows any test to call Mox.expect
on these mocks, defining their behavior for a specific test case.
Phased Integration Plan
Phase 1: Implement the Execution Core (Client
and Adapter
)
Goal: To make DSPEx.Predict
work by successfully making a request cycle using a mocked Client
.
Steps:
Implement
DSPEx.Client.OpenAI
(and/or others):- It must implement the
@behaviour DSPEx.Client
. - The actual HTTP request logic (using
Req
) will be in this module. - Crucial: The client used by a program should be configurable, defaulting to
dspy.settings.lm
, allowing us to inject a mock during tests.
- It must implement the
Implement
DSPEx.Adapter.Chat
:- It must implement the
@behaviour DSPEx.Adapter
. - The
format/3
function will take aSignature
andExample
s and produce a list of chat messages. - The
parse/2
function will take a raw LLM response map and extract the structured output fields.
- It must implement the
Test
DSPEx.Predict
in Isolation:- The
DSPEx.Predict.forward/3
function orchestrates theAdapter
andClient
. - Test Case:
- Create a
test "Predict module orchestrates adapter and client correctly"
case. - Mock the Client:
Mox.expect(MockClient, :request, fn _messages -> {:ok, %{"choices" => [%{"message" => %{"content" => "Answer: 42"}}]}} end)
- Mock the Adapter: Use
Mox.expect
forformat
andparse
to ensure they are called correctly. - Instantiate
DSPEx.Predict
with the mock client. - Call
forward/3
on thePredict
module. - Verify that the final output is the correctly parsed map, e.g.,
%{answer: "42"}
.
- Create a
- The
Outcome of Phase 1: We have a working Predict
module whose external dependencies (Client
) and internal logic (Adapter
) can be fully mocked and verified.
Phase 2: Implement the Evaluation Loop (Evaluate
)
Goal: To create an Evaluator
that can score a Program
’s performance on a dataset, testable without running a real program.
Steps:
Implement
DSPEx.Evaluate
:- It must implement the
@behaviour DSPEx.Evaluator
. - The
run/4
function will take aProgram
, adataset
, and ametric_fn
. - It will use
Task.async_stream
to callProgram.forward/3
for each example in the dataset concurrently. - It will aggregate the scores from the
metric_fn
.
- It must implement the
Test
DSPEx.Evaluate
with a Mock Program:- Test Case:
Mox.stub_with(MockProgram, DSPEx.Program.Stub)
if needed or just useMox.expect
.- Define the behavior of the mock program:
# Expect two calls to forward, one succeeding, one failing the metric Mox.expect(MockProgram, :forward, 2, fn _program, %{question: "q1"}, _opts -> {:ok, %DSPEx.Prediction{outputs: %{answer: "correct"}}} end)
- Define a simple
metric_fn
likefn example, prediction -> if example.answer == prediction.outputs.answer, do: 1.0, else: 0.0 end
. - Call
DSPEx.Evaluate.run/4
with theMockProgram
and a two-example dataset. - Assert that the final score is
0.5
.
- Test Case:
Outcome of Phase 2: We have a working, concurrent Evaluator
whose core logic (concurrency, aggregation) can be tested independently of the program it’s evaluating.
Phase 3: Implement the Optimization Engine (Teleprompter
)
Goal: To implement a Teleprompter
(e.g., BootstrapFewShot
) whose complex orchestration logic can be tested without any real LLM calls or slow evaluations.
Steps:
Implement
DSPEx.Teleprompter.BootstrapFewShot
:- It must implement the
@behaviour DSPEx.Teleprompter
. - The
compile/2
function will:- Take a
student
andteacher
Program
. - Run the
teacher
over atrainset
to generate potential demonstrations. - Use a mocked or real
Evaluator
to score these demonstrations. - Select the best demonstrations based on the metric score.
- Return a new
student
program with these demos attached.
- Take a
- It must implement the
Test the
Teleprompter
with a Fully Mocked Stack:- This is the most critical test, demonstrating the power of the contract-first approach.
- Test Case:
- Mock the
teacher
program: Define itsforward
behavior to produce predictable “good” (high-score) and “bad” (low-score) outputs for different inputs. - Mock the
Evaluator
: Instead of a real evaluation, define itsrun
behavior to return pre-defined scores. This makes the test instantaneous.# Tell the mock evaluator to return a high score for the "good" program, low for "bad" Mox.expect(MockEvaluator, :run, fn program, _dataset, _metric, _opts -> if program_is_good?(program), do: {:ok, 0.9}, else: {:ok, 0.1} end)
- Instantiate the
BootstrapFewShot
teleprompter, passing it the mocked evaluator. - Call
compile/2
with thestudent
and the mocked teacher. - Inspect the returned
optimized_student
. Assert that its.demos
field contains only the demonstrations that would have been generated by the “good” outputs from the mocked teacher.
- Mock the
Outcome of Phase 3: We have a fully testable optimization engine. We can verify its complex selection and compilation logic without waiting for LLM calls or evaluations, leading to rapid development and debugging.
Benefits of this Approach
- Testability and Mockability: Each component can be unit-tested in isolation, drastically reducing test complexity and execution time.
- Developer Productivity: Developers can work on different components (
Adapter
,Teleprompter
) in parallel, relying on the establishedbehaviour
contracts. - Robustness: The formal contracts make the system’s internal API clear and reduce integration errors.
- Performance: The core evaluation loop in
DSPEx.Evaluate
is built for concurrency from day one, leveraging the BEAM’s strengths. - Clarity: The separation of concerns is enforced by the contracts, making the codebase easier to understand and maintain.