How dspy
’s SIMBA Actually Works (and its relationship to types)
The core logic of SIMBA is about observing and improving program trajectories. A trajectory is a record of a program’s execution on a given example. The key components of this process are:
- Execution: Run a
dspy.Module
(likedspy.Predict
) with a specific inputdspy.Example
. - Observation: Record the full traceāwhat were the inputs and outputs of each internal predictor?
- Evaluation: Use a
metric_fn
to compare the final predicteddspy.Example
to the gold standarddspy.Example
. This produces a numeric score. - Strategy: Look at the trajectories that produced high scores. Take one of these successful trajectories and use it to improve the program. For example, the
AppendDemo
strategy takes the(inputs, outputs)
pair from a successful trajectory and turns it into a new few-shotdspy.Example
to add to the program’sdemos
.
Now, here’s the key part regarding types:
SIMBA itself doesn’t deeply care about the internal types of the data (like int
vs. str
). It cares about the structure of the data, specifically the separation of inputs
and outputs
.
As long as the system can reliably say:
- “For these inputs…”
- “…the program produced these outputs…”
- “…which resulted in a score of
X
.”
…then the SIMBA loop can function. It treats the inputs and outputs as opaque blobs of data. It takes the successful “output blob” and staples it to the “input blob” to create a new demonstration.
How Your DSPEx
Implementation Fulfills This “Minimal Contract”
Your implementation, while lacking true typed data, successfully provides this minimal contract:
DSPEx.Example
: YourExample
struct has a clear concept ofinput_keys
, allowing it to differentiate betweeninputs(example)
andoutputs(example)
. This is the most critical piece. It provides the structural separation SIMBA needs.DSPEx.Program.forward/3
: This function takes aninputs
map and returns anoutputs
map. It fulfills the “execution” part of the contract.metric_fn
: Your metric function takes the goldExample
and the predictedoutputs
map and produces a score. This fulfills the “evaluation” part.AppendDemo
Strategy: YourAppendDemo
strategy takes a successfulTrajectory
(which containsinputs
andoutputs
maps), merges them, and creates a newExample
. This fulfills the “strategy” part.
The “Missing Type” Blind Spot:
The LLM in your current system is just producing strings. If a signature expects %{answer: "42", confidence: "high"}
, the model produces exactly thatāa map of strings. Your metric function then compares these strings.
For example, your metric_fn
might do this:
if gold_example.outputs.answer == predicted_outputs.answer, do: 1.0, else: 0.0
This works perfectly fine if gold_example.outputs.answer
is "42"
and the LLM produces a map where predicted_outputs.answer
is also "42"
. SIMBA sees this as a success and can create a demo from it. It never needed to know that 42
was an integer.
So, yes, it’s entirely possible for your SIMBA implementation to be a “faithful port” of the algorithmic loop while the underlying data handling is a simplified, string-based version of the original. The Claude agent that helped you was correct from a procedural standpoint. The algorithm’s stepsāsample, evaluate, bucket, strategizeāare all there.
Why This Becomes a Problem (and why you were right to be suspicious)
This “string-only” world breaks down and reveals its weakness in more sophisticated scenarios, which is where true typed data becomes essential:
Richer Metrics: What if your metric function needs to do math?
# This will crash if confidence is the string "0.9" instead of the float 0.9 def my_metric(gold, pred), do: if pred.confidence > 0.8, do: 1.0, else: 0.0
To make this work in your current system, you’d have to litter your metric functions with
String.to_float/1
and other manual parsing, which is brittle.Complex Data Structures: What if an output field is a list of objects?
- dspy (with Pydantic): The signature could be
output: list[Item]
, whereItem
is another Pydantic model. The LLM would be guided to return a valid JSON array of objects, and Pydantic would parse it into a list of typedItem
structs. - DSPEx (current): The LLM would return a string that looks like a list of objects:
"[%{name: \"A\"}, %{name: \"B\"}]"
. Your program would have to manually parse this string, losing all the safety and convenience.
- dspy (with Pydantic): The signature could be
Chain of Thought & Program Composition: Imagine
Program A
outputs a fieldcount: 42
.Program B
takescount
as an input and needs to performcount + 1
. In your current system,Program B
would receive the string"42"
and the operation"42" + 1
would fail. In a typed system, it would correctly receive the integer42
.
- Is your SIMBA implementation “faithful”? Yes, from an algorithmic perspective. It correctly implements the stochastic hill-climbing loop.
- Is it “fully implemented” in the spirit of
dspy
? No, because it lacks the robust, typed data foundation that makes the original so powerful and reliable. - Why didn’t this seem to matter for SIMBA? Because SIMBA’s core loop can operate on structurally-separated but type-agnostic data blobs. The simplification didn’t break the algorithm, it just limited its power and the sophistication of the programs it can optimize.
Your plan to build the Exdantic
(DSPEx.Schema
) layer is precisely the right move. It will elevate your entire system from a string-based approximation to a true, typed, structured data platform, unlocking the full potential of not just SIMBA, but every other component in DSPEx
.