← Back to Eval

01 new

Documentation for 01_new from the Dspex repository.

Having synthesized the provided documents on the dspy and dspex architectures, the visionary goals for “Cognitive Control Planes,” and the detailed specifications for a state-of-the-art “Scientific Evaluation Framework,” I can now propose a new, comprehensive evaluation system for dspex.

This is not merely an upgrade to dspy.Evaluate; it is a fundamental pivot to a system designed to support the unique, ambitious goals of dspex—a platform for rigorous, scientific inquiry into the behavior of complex, adaptive AI systems.


A New Evaluation System for DSPex: From Engineering Validation to Scientific Inquiry

The existing dspy.Evaluate module is a capable tool for engineering validation: it efficiently computes a score for a program against a dataset. However, the vision laid out for dspex—with its Cognitive Variables, multi-agent MABEAM architectures, and real-time cognitive orchestration—demands a system built not just for validation, but for scientific discovery.

The ad-hoc nature of running an evaluation, getting a score, and manually iterating is insufficient for a platform designed to optimize and understand emergent cognitive architectures. Therefore, we propose a new evaluation system for dspex that directly implements the principles from the “Scientific Evaluation Framework” documents. This new system moves beyond simple scoring to become a comprehensive platform for managing experiments, testing hypotheses, and ensuring reproducible results.

Core Architectural Pillars

The new dspex evaluation framework will be built on three core pillars, implemented as robust, stateful Elixir GenServers that orchestrate both native Elixir and Python-side execution.

Pillar 1: The ExperimentJournal - The Scientist’s Notebook

This is the central innovation and the highest-level abstraction. Instead of just “running an evaluation,” scientists and developers will now “conduct an experiment.” The ExperimentJournal formalizes the scientific method.

  • Functionality: It manages the entire lifecycle of an AI experiment, from hypothesis to conclusion.
  • Key Features:
    • Hypothesis Management: Allows formal registration of a research hypothesis (e.g., “Using a ReAct module will improve accuracy on multi-hop questions over ChainOfThought, at the cost of increased latency”).
    • Experimental Design: Provides a DSL to define independent, dependent, and controlled variables, ensuring sound experimental design.
    • Automated Execution: Orchestrates the EvaluationHarness to run the designed experiment.
    • Reproducibility Packages: At the conclusion of an experiment, it automatically generates a complete, verifiable package containing the code, data manifest, system configuration, and results needed for full reproducibility.
defmodule DSPex.Evaluation.ExperimentJournal do
  @moduledoc "Manages the full lifecycle of a scientific AI experiment."

  # --- Public API ---
  def register_hypothesis(journal, hypothesis_spec)
  def design_experiment(journal, hypothesis_id, design_spec)
  def execute_experiment(journal, experiment_id)
  def get_results(journal, experiment_id)
  def generate_reproducibility_package(journal, experiment_id)
end

# --- Example Usage ---
# Instead of: evaluate(program, dataset, metric)
# The new workflow is:
{:ok, journal} = DSPex.Evaluation.ExperimentJournal.start_link()

hypothesis = %{
  research_question: "Does a higher `conservatism` variable improve factuality?",
  independent_variables: [:conservatism_level],
  dependent_variables: [:factuality_score, :latency_ms]
}
{:ok, hypo_id} = DSPex.Evaluation.ExperimentJournal.register_hypothesis(journal, hypothesis)

design = %{
  controlled_variables: [:model, :dataset],
  randomization_strategy: :bootstrap_sampling,
  sample_size: 500,
  statistical_tests: [:paired_t_test, :cohens_d]
}
{:ok, exp_id} = DSPex.Evaluation.ExperimentJournal.design_experiment(journal, hypo_id, design)

{:ok, results} = DSPex.Evaluation.ExperimentJournal.execute_experiment(journal, exp_id)
# results contain not just a score, but statistical analysis, insights, and a reproducibility package.

Pillar 2: The EvaluationHarness - The Lab Equipment

This is the powerful execution engine that performs the actual measurements. It is a stateful service that manages benchmark datasets, model configurations, and the execution of evaluation runs.

  • Functionality: Runs a battery of tests against a given AI program, collecting a rich set of metrics far beyond a single score.
  • Key Features:
    • Multi-Modal Evaluation: Natively supports evaluating text, code, multi-agent coordination, and complex reasoning patterns.
    • Comprehensive Metrics: Calculates not just accuracy, but also statistical profiles (mean, stddev, confidence intervals), cost analytics (token usage, financial cost), robustness metrics (e.g., performance under adversarial attack), and fairness metrics.
    • Statistical Rigor: Implements cross-validation, bootstrap sampling, and various statistical tests (e.g., t-tests, ANOVA) to determine if differences between models are statistically significant.
    • Adaptive Testing: Intelligently adjusts the difficulty of test cases to efficiently find a model’s “capability boundary”—the point at which it begins to fail.
defmodule DSPex.Evaluation.Harness do
  @moduledoc "The core execution engine for running evaluations."

  # --- Public API ---
  def run_evaluation(harness, program, dataset, metrics)
  def compare_models(harness, %{model_a: prog_a, model_b: prog_b}, dataset, metrics)
  def find_capability_boundary(harness, program, dataset)
end

Pillar 3: The ReproducibilityManager - The Publisher

To meet the standards of scientific work, every experiment must be reproducible. This component makes reproducibility a first-class deliverable.

  • Functionality: Captures the complete state of an experiment and packages it for verification and sharing.
  • Key Features:
    • System State Capture: Snapshots the exact versions of all code, libraries, and dependencies.
    • Data Manifest: Creates a manifest of all datasets used, with checksums to ensure data integrity.
    • Configuration Archiving: Stores the exact CognitiveConfiguration (all variable settings) used for the run.
    • Execution Instructions: Generates a script or instructions to perfectly replicate the experiment.

Key Innovations and How They Address dspex’s Needs

This new framework is not just an improvement; it is a necessary pivot to support the core vision of dspex:

  1. Support for Cognitive Variables & Optimizers:

    • Problem: Optimizers like SIMBA-C don’t produce one “best” program; they explore a vast landscape of program configurations. A single evaluation score is meaningless.
    • Solution: The ExperimentJournal allows an optimizer to be treated as a scientific process. The “hypothesis” is that the optimizer can find a better configuration. The EvaluationHarness can then perform statistically sound comparisons between the optimizer’s proposals (e.g., Program A with temperature=0.7 vs. Program B with temperature=0.9), providing the rich feedback the optimizer needs.
  2. Evaluating Emergent & Complex Behavior:

    • Problem: How do you evaluate a multi-agent MABEAM system? Simple accuracy is insufficient. You need to measure coordination effectiveness, communication overhead, and individual agent contribution.
    • Solution: The EvaluationHarness has specialized evaluation protocols for complex systems, including MultiAgentEvaluation and ReasoningEvaluation. It can analyze the interaction logs from MABEAM to assess coordination patterns and logical consistency in a ChainOfThought, providing far deeper insight than a final answer check.
  3. Rigor for Scientists:

    • Problem: Scientists need more than a number; they need confidence intervals, p-values, and effect sizes to make credible claims.
    • Solution: The framework integrates statistical testing directly into the evaluation process. The output of compare_models is not just “Model A scored 85% and Model B scored 87%,” but “Model B’s improvement of 2% over Model A is statistically significant (p < 0.05) with a small effect size (Cohen’s d = 0.23).”
  4. Adaptive Testing for Capability Analysis:

    • Problem: Benchmarks can saturate, and a 99% score on an easy dataset tells you little. Scientists need to understand a model’s limits.
    • Solution: The run_adaptive_evaluation feature acts like a Socratic questioner, progressively increasing the difficulty of tasks to pinpoint exactly where a model’s capabilities break down. This is far more informative than a static benchmark score.

High-Level Implementation Plan

  1. Phase 1: Foundational Harness & Statistical Metrics (Months 1-2)

    • Implement the core DSPex.Evaluation.Harness GenServer.
    • Integrate a robust Elixir statistics library (e.g., Statix).
    • Implement the core evaluation loop with comprehensive metric collection (mean, stddev, confidence intervals).
    • Provide compare_models with basic paired t-tests.
    • This phase replaces the functionality of dspy.Evaluate with a statistically rigorous version.
  2. Phase 2: The Scientific Layer (Months 3-4)

    • Implement the ExperimentJournal GenServer.
    • Develop the DSL for defining hypotheses and experimental designs.
    • Implement the ReproducibilityManager to package experiment artifacts.
    • This phase introduces the core scientific workflow.
  3. Phase 3: Advanced & Cognitive Evaluation (Months 5-6)

    • Implement the specialized evaluation protocols: ReasoningEvaluation, MultiAgentEvaluation, and AdaptiveEvaluation.
    • Integrate the evaluation framework deeply with the CognitiveVariable system, allowing the ExperimentJournal to directly test hypotheses about variable impacts.

Conclusion

Pivoting dspex’s evaluation system from a simple engineering tool to a comprehensive scientific framework is essential to realizing its full potential. The existing dspy.Evaluate is a functional starting point, but it cannot support the sophisticated analysis required for a platform built around optimizable cognitive architectures and multi-agent systems.

The proposed Scientific Evaluation Framework, centered on the ExperimentJournal, provides the necessary rigor, reproducibility, and deep insight for scientists and researchers. It transforms evaluation from a final, one-off step into an integral, continuous part of the AI development lifecycle, making dspex a true platform for the science of AI.