Excellent. With the native execution core in place from Phase 2, Phase 3 focuses on building out the production-grade MLOps capabilities and deepening the integration with the broader Elixir ecosystem. This is where DSPex
evolves from a powerful library into a complete, manageable platform.
5. Detailed Roadmap: Phase 3 - Native Optimization & MLOps
Objective: To provide a comprehensive, native Elixir toolkit for the entire lifecycle of an LLM program: from data handling and advanced optimization to evaluation and monitoring. This phase solidifies DSPex
as a self-contained, production-ready platform.
Month 5: Data, Retrieval, and Advanced Evaluation
Week | Epic | Key Tasks & Implementation Details | Success Criteria |
---|---|---|---|
Week 17 | (3.1) Native Dataset & Retrieval Foundation | Based on Retrieval System gap analysis:• DSPex.Dataset Resource: An Ash resource for managing datasets, including metadata, schema, and statistics.• DSPex.Retrieve Behaviour: A native Elixir behaviour defining the interface for all retrieval models (RMs).• PGVector Adapter: Implement the first native RM adapter for pgvector . This involves using Ecto to perform <=> vector similarity searches directly against a PostgreSQL database.• PythonRetriever Adapter: Create a hybrid adapter that can delegate retrieval calls to any dspy RM via the Python bridge. | ✅ Can create a Dataset in Ash.✅ A native DSPex program can use the PGVector adapter to retrieve documents for a RAG pipeline.✅ Can successfully use a Python-based retriever (e.g., dspy.ColBERTv2 ) from an Elixir program. |
Week 18 | (3.2) Scientific Evaluation Framework | Based on 1203_SCIENTIFIC_EVALUATION_FRAMEWORK.md :• DSPex.Evaluate.Harness : A module to run a program over a dataset and apply one or more metric functions.• DSPex.Evaluate.Metrics : Implement native versions of common dspy metrics: answer_exact_match , answer_passage_match , f1_score .• Multi-threaded Evaluation: The harness must use Task.async_stream to evaluate the dataset in parallel.• Result Aggregation: The harness should return aggregate scores (average, pass rate) and a list of detailed results for each example. | ✅ DSPex.Evaluate.Harness.run(program, dev_set, [metric_fn]) executes and returns a detailed evaluation report with an aggregate score. |
Week 19 | (3.3) Native SIMBA Optimizer | Based on Teleprompters gap analysis:• Implement DSPex.Optimizers.SIMBA as a native Elixir teleprompter.• Core Loop: The optimizer iteratively selects a module, generates a simple change to its signature’s instructions, and evaluates the change. • Integration: SIMBA will use the native Evaluation Harness to score each modification.• Parameter Tracking: It will update the Program ’s Ash resource with the best-found instructions after the optimization run is complete. | ✅ DSPex.Optimizers.SIMBA.compile(program, trainset) successfully runs and measurably improves the program’s accuracy on a dev set by modifying its prompts. |
Week 20 | (3.4) Experiment Management | Based on 1203_SCIENTIFIC_EVALUATION_FRAMEWORK.md :• Experiment Ash Resource: Create a resource to track an entire optimization experiment, linking a Program , Optimizer , Dataset , and the final OptimizationResult .• Hypothesis Management: Add fields for hypothesis , independent_variables (e.g., different prompts), and dependent_variables (e.g., accuracy, latency).• Result Analysis: The resource should have calculations or actions to compare the baseline program’s performance against the optimized version. | ✅ An Experiment can be created, linking all components.✅ After an optimization run, the Experiment record contains the full results, including the performance delta between the original and optimized program. |
Month 6: Production Monitoring & Tooling
Week | Epic | Key Tasks & Implementation Details | Success Criteria |
---|---|---|---|
Week 21 | (3.5) Advanced Observability | Based on Tools/Utilities gap analysis:• DSPex.Telemetry : Define and emit a comprehensive set of Telemetry events for all key operations (program execution, optimization, tool calls, RAG queries).• Phoenix.LiveDashboard Integration: Create a custom dashboard page for DSPex that shows:- Live-updating stats on active programs. - A chart of prediction latency over time. - A counter for total tokens used and estimated cost. • Instrument the native LM Client to emit detailed metrics on provider latency and error rates. | ✅ The custom LiveDashboard page displays real-time metrics from a running DSPex application.✅ Can trace a single DSPex.Program.execute call through Telemetry events. |
Week 22 | (3.6) Advanced Prediction Modules | Based on Predict Modules gap analysis:• DSPex.Native.Retry : A module that wraps another predictor and retries it on failure, using the assertion framework from Phase 2.• DSPex.Native.MultiChainComparison : A module that runs the same input through multiple ChainOfThought instances (perhaps with different prompts or temperatures) and then uses a final LLM call to vote on the best answer.• DSPex.Native.Parallel : An interface that takes a list of inputs and runs them through a module concurrently using Task.async_stream , providing a simple dspy.Parallel -like API. | ✅ Can wrap a fallible predictor with Retry to improve its success rate.✅ Can use MultiChainComparison to generate more robust answers than a single CoT. |
Week 23 | (3.7) Advanced Caching & State Management | • DSPex.Cache : Implement a more sophisticated, pluggable caching backend. Allow ETS for local caching and provide an adapter for Redis for distributed caching.• Saving/Loading ( dspy.save /dspy.load ): Implement DSPex.Program.save/2 and DSPex.Program.load/1 . This is simplified by Ash—save just means persisting the Ash record. load involves reading the record and re-hydrating any necessary processes or state (like re-registering tools). | ✅ Can configure the system to use a Redis cache for LLM calls. ✅ Can stop the application, and upon restart, DSPex.Program.load(id) successfully restores a program to a runnable state. |
Week 28 | (3.8) Documentation & Developer Experience | • Write Comprehensive Guides: Create tutorials for common patterns: building a RAG app, creating a custom optimizer, integrating with LiveView. • API Reference: Ensure all public modules and functions have complete @moduledoc and @doc coverage.• Cookbook Recipes: Add a “cookbook” section to the documentation with recipes for solving common problems. | ✅ A new developer can successfully build a functioning RAG application by following only the official DSPex documentation. |
Phase 3 Deliverables & Outcome
By the end of Phase 3, DSPex
will have matured into a powerful, self-contained platform. It will not only be capable of running complex AI workflows but also optimizing and evaluating them with a level of rigor and performance that is uniquely enabled by the Elixir ecosystem.
Key Deliverables:
- A native RAG system with at least one vector DB adapter.
- A scientific evaluation harness for rigorous model and program assessment.
- Native implementations of key optimizers (
SIMBA
) and advanced predictors (Retry
,MultiChain
). - A rich set of Telemetry events and a pre-built LiveDashboard for production monitoring.
- A robust caching layer with support for distributed caches like Redis.
Resulting State of the System:
- Self-Sufficient: The platform can now handle the entire program lifecycle—from data loading to optimization to evaluation—natively in Elixir for many common use cases.
- Production-Ready: With deep observability and robust MLOps tooling (
Experiment
resource,Evaluation Harness
), the system is ready for serious production workloads. - Hybrid Power: The system intelligently combines the best of both worlds: a high-performance native core for orchestration, evaluation, and common prediction patterns, with a strategic bridge to Python for specialized RMs and optimizers.
This phase moves the project beyond just being a “port” and establishes it as a powerful, standalone MLOps platform with a unique value proposition.
Shall we proceed to detail the final phase, Stage 4?