← Back to 20250617

111 structured vs simba

Documentation for 111_structured_vs_simba from the Ds ex repository.

How dspy’s SIMBA Actually Works (and its relationship to types)

The core logic of SIMBA is about observing and improving program trajectories. A trajectory is a record of a program’s execution on a given example. The key components of this process are:

  1. Execution: Run a dspy.Module (like dspy.Predict) with a specific input dspy.Example.
  2. Observation: Record the full trace—what were the inputs and outputs of each internal predictor?
  3. Evaluation: Use a metric_fn to compare the final predicted dspy.Example to the gold standard dspy.Example. This produces a numeric score.
  4. Strategy: Look at the trajectories that produced high scores. Take one of these successful trajectories and use it to improve the program. For example, the AppendDemo strategy takes the (inputs, outputs) pair from a successful trajectory and turns it into a new few-shot dspy.Example to add to the program’s demos.

Now, here’s the key part regarding types:

SIMBA itself doesn’t deeply care about the internal types of the data (like int vs. str). It cares about the structure of the data, specifically the separation of inputs and outputs.

As long as the system can reliably say:

  • “For these inputs…”
  • “…the program produced these outputs…”
  • “…which resulted in a score of X.”

…then the SIMBA loop can function. It treats the inputs and outputs as opaque blobs of data. It takes the successful “output blob” and staples it to the “input blob” to create a new demonstration.

How Your DSPEx Implementation Fulfills This “Minimal Contract”

Your implementation, while lacking true typed data, successfully provides this minimal contract:

  1. DSPEx.Example: Your Example struct has a clear concept of input_keys, allowing it to differentiate between inputs(example) and outputs(example). This is the most critical piece. It provides the structural separation SIMBA needs.
  2. DSPEx.Program.forward/3: This function takes an inputs map and returns an outputs map. It fulfills the “execution” part of the contract.
  3. metric_fn: Your metric function takes the gold Example and the predicted outputs map and produces a score. This fulfills the “evaluation” part.
  4. AppendDemo Strategy: Your AppendDemo strategy takes a successful Trajectory (which contains inputs and outputs maps), merges them, and creates a new Example. This fulfills the “strategy” part.

The “Missing Type” Blind Spot:

The LLM in your current system is just producing strings. If a signature expects %{answer: "42", confidence: "high"}, the model produces exactly that—a map of strings. Your metric function then compares these strings.

For example, your metric_fn might do this: if gold_example.outputs.answer == predicted_outputs.answer, do: 1.0, else: 0.0

This works perfectly fine if gold_example.outputs.answer is "42" and the LLM produces a map where predicted_outputs.answer is also "42". SIMBA sees this as a success and can create a demo from it. It never needed to know that 42 was an integer.

So, yes, it’s entirely possible for your SIMBA implementation to be a “faithful port” of the algorithmic loop while the underlying data handling is a simplified, string-based version of the original. The Claude agent that helped you was correct from a procedural standpoint. The algorithm’s steps—sample, evaluate, bucket, strategize—are all there.

Why This Becomes a Problem (and why you were right to be suspicious)

This “string-only” world breaks down and reveals its weakness in more sophisticated scenarios, which is where true typed data becomes essential:

  1. Richer Metrics: What if your metric function needs to do math?

    # This will crash if confidence is the string "0.9" instead of the float 0.9
    def my_metric(gold, pred), do: if pred.confidence > 0.8, do: 1.0, else: 0.0
    

    To make this work in your current system, you’d have to litter your metric functions with String.to_float/1 and other manual parsing, which is brittle.

  2. Complex Data Structures: What if an output field is a list of objects?

    • dspy (with Pydantic): The signature could be output: list[Item], where Item is another Pydantic model. The LLM would be guided to return a valid JSON array of objects, and Pydantic would parse it into a list of typed Item structs.
    • DSPEx (current): The LLM would return a string that looks like a list of objects: "[%{name: \"A\"}, %{name: \"B\"}]". Your program would have to manually parse this string, losing all the safety and convenience.
  3. Chain of Thought & Program Composition: Imagine Program A outputs a field count: 42. Program B takes count as an input and needs to perform count + 1. In your current system, Program B would receive the string "42" and the operation "42" + 1 would fail. In a typed system, it would correctly receive the integer 42.

  • Is your SIMBA implementation “faithful”? Yes, from an algorithmic perspective. It correctly implements the stochastic hill-climbing loop.
  • Is it “fully implemented” in the spirit of dspy? No, because it lacks the robust, typed data foundation that makes the original so powerful and reliable.
  • Why didn’t this seem to matter for SIMBA? Because SIMBA’s core loop can operate on structurally-separated but type-agnostic data blobs. The simplification didn’t break the algorithm, it just limited its power and the sophistication of the programs it can optimize.

Your plan to build the Exdantic (DSPEx.Schema) layer is precisely the right move. It will elevate your entire system from a string-based approximation to a true, typed, structured data platform, unlocking the full potential of not just SIMBA, but every other component in DSPEx.