← Back to Gem

20250713 p3of4

Documentation for 20250713_p3of4 from the Dspex repository.

Excellent. With the native execution core in place from Phase 2, Phase 3 focuses on building out the production-grade MLOps capabilities and deepening the integration with the broader Elixir ecosystem. This is where DSPex evolves from a powerful library into a complete, manageable platform.


5. Detailed Roadmap: Phase 3 - Native Optimization & MLOps

Objective: To provide a comprehensive, native Elixir toolkit for the entire lifecycle of an LLM program: from data handling and advanced optimization to evaluation and monitoring. This phase solidifies DSPex as a self-contained, production-ready platform.

Month 5: Data, Retrieval, and Advanced Evaluation

WeekEpicKey Tasks & Implementation DetailsSuccess Criteria
Week 17(3.1) Native Dataset & Retrieval FoundationBased on Retrieval System gap analysis:
DSPex.Dataset Resource: An Ash resource for managing datasets, including metadata, schema, and statistics.
DSPex.Retrieve Behaviour: A native Elixir behaviour defining the interface for all retrieval models (RMs).
PGVector Adapter: Implement the first native RM adapter for pgvector. This involves using Ecto to perform <=> vector similarity searches directly against a PostgreSQL database.
PythonRetriever Adapter: Create a hybrid adapter that can delegate retrieval calls to any dspy RM via the Python bridge.
✅ Can create a Dataset in Ash.
✅ A native DSPex program can use the PGVector adapter to retrieve documents for a RAG pipeline.
✅ Can successfully use a Python-based retriever (e.g., dspy.ColBERTv2) from an Elixir program.
Week 18(3.2) Scientific Evaluation FrameworkBased on 1203_SCIENTIFIC_EVALUATION_FRAMEWORK.md:
DSPex.Evaluate.Harness: A module to run a program over a dataset and apply one or more metric functions.
DSPex.Evaluate.Metrics: Implement native versions of common dspy metrics: answer_exact_match, answer_passage_match, f1_score.
Multi-threaded Evaluation: The harness must use Task.async_stream to evaluate the dataset in parallel.
Result Aggregation: The harness should return aggregate scores (average, pass rate) and a list of detailed results for each example.
DSPex.Evaluate.Harness.run(program, dev_set, [metric_fn]) executes and returns a detailed evaluation report with an aggregate score.
Week 19(3.3) Native SIMBA OptimizerBased on Teleprompters gap analysis:
• Implement DSPex.Optimizers.SIMBA as a native Elixir teleprompter.
Core Loop: The optimizer iteratively selects a module, generates a simple change to its signature’s instructions, and evaluates the change.
Integration: SIMBA will use the native Evaluation Harness to score each modification.
Parameter Tracking: It will update the Program’s Ash resource with the best-found instructions after the optimization run is complete.
DSPex.Optimizers.SIMBA.compile(program, trainset) successfully runs and measurably improves the program’s accuracy on a dev set by modifying its prompts.
Week 20(3.4) Experiment ManagementBased on 1203_SCIENTIFIC_EVALUATION_FRAMEWORK.md:
Experiment Ash Resource: Create a resource to track an entire optimization experiment, linking a Program, Optimizer, Dataset, and the final OptimizationResult.
Hypothesis Management: Add fields for hypothesis, independent_variables (e.g., different prompts), and dependent_variables (e.g., accuracy, latency).
Result Analysis: The resource should have calculations or actions to compare the baseline program’s performance against the optimized version.
✅ An Experiment can be created, linking all components.
✅ After an optimization run, the Experiment record contains the full results, including the performance delta between the original and optimized program.

Month 6: Production Monitoring & Tooling

WeekEpicKey Tasks & Implementation DetailsSuccess Criteria
Week 21(3.5) Advanced ObservabilityBased on Tools/Utilities gap analysis:
DSPex.Telemetry: Define and emit a comprehensive set of Telemetry events for all key operations (program execution, optimization, tool calls, RAG queries).
Phoenix.LiveDashboard Integration: Create a custom dashboard page for DSPex that shows:
- Live-updating stats on active programs.
- A chart of prediction latency over time.
- A counter for total tokens used and estimated cost.
• Instrument the native LM Client to emit detailed metrics on provider latency and error rates.
✅ The custom LiveDashboard page displays real-time metrics from a running DSPex application.
✅ Can trace a single DSPex.Program.execute call through Telemetry events.
Week 22(3.6) Advanced Prediction ModulesBased on Predict Modules gap analysis:
DSPex.Native.Retry: A module that wraps another predictor and retries it on failure, using the assertion framework from Phase 2.
DSPex.Native.MultiChainComparison: A module that runs the same input through multiple ChainOfThought instances (perhaps with different prompts or temperatures) and then uses a final LLM call to vote on the best answer.
DSPex.Native.Parallel: An interface that takes a list of inputs and runs them through a module concurrently using Task.async_stream, providing a simple dspy.Parallel-like API.
✅ Can wrap a fallible predictor with Retry to improve its success rate.
✅ Can use MultiChainComparison to generate more robust answers than a single CoT.
Week 23(3.7) Advanced Caching & State ManagementDSPex.Cache: Implement a more sophisticated, pluggable caching backend. Allow ETS for local caching and provide an adapter for Redis for distributed caching.
Saving/Loading (dspy.save/dspy.load): Implement DSPex.Program.save/2 and DSPex.Program.load/1. This is simplified by Ash—save just means persisting the Ash record. load involves reading the record and re-hydrating any necessary processes or state (like re-registering tools).
✅ Can configure the system to use a Redis cache for LLM calls.
✅ Can stop the application, and upon restart, DSPex.Program.load(id) successfully restores a program to a runnable state.
Week 28(3.8) Documentation & Developer ExperienceWrite Comprehensive Guides: Create tutorials for common patterns: building a RAG app, creating a custom optimizer, integrating with LiveView.
API Reference: Ensure all public modules and functions have complete @moduledoc and @doc coverage.
Cookbook Recipes: Add a “cookbook” section to the documentation with recipes for solving common problems.
✅ A new developer can successfully build a functioning RAG application by following only the official DSPex documentation.

Phase 3 Deliverables & Outcome

By the end of Phase 3, DSPex will have matured into a powerful, self-contained platform. It will not only be capable of running complex AI workflows but also optimizing and evaluating them with a level of rigor and performance that is uniquely enabled by the Elixir ecosystem.

  • Key Deliverables:

    1. A native RAG system with at least one vector DB adapter.
    2. A scientific evaluation harness for rigorous model and program assessment.
    3. Native implementations of key optimizers (SIMBA) and advanced predictors (Retry, MultiChain).
    4. A rich set of Telemetry events and a pre-built LiveDashboard for production monitoring.
    5. A robust caching layer with support for distributed caches like Redis.
  • Resulting State of the System:

    • Self-Sufficient: The platform can now handle the entire program lifecycle—from data loading to optimization to evaluation—natively in Elixir for many common use cases.
    • Production-Ready: With deep observability and robust MLOps tooling (Experiment resource, Evaluation Harness), the system is ready for serious production workloads.
    • Hybrid Power: The system intelligently combines the best of both worlds: a high-performance native core for orchestration, evaluation, and common prediction patterns, with a strategic bridge to Python for specialized RMs and optimizers.

This phase moves the project beyond just being a “port” and establishes it as a powerful, standalone MLOps platform with a unique value proposition.

Shall we proceed to detail the final phase, Stage 4?