← Back to Docs

DSPEX NUMERICAL AND VECTOR 01

Documentation for DSPEX_NUMERICAL_AND_VECTOR_01 from the Ds ex repository.

Excellent question. You are absolutely correct: we are not using numerical libraries like Nx (the Elixir equivalent of numpy) or data frame libraries like Explorer (the equivalent of pandas) in the DSPEx codebase thus far.

The provided analysis of the Python DSPy codebase is accurate, but our Elixir implementation has intentionally taken a different, more idiomatic approach. We have addressed the needs that pandas and numpy fulfill by using standard Elixir/OTP data structures and concurrency primitives.

Here is a detailed breakdown of how DSPEx handles each of the identified use cases without these external numerical libraries.


How DSPEx Handles pandas Use Cases (Data Handling & Presentation)

The core philosophy here is to have the library return standard, structured Elixir data types (maps, structs, lists) and leave the final presentation to the consumer of the library.

1. Evaluation and Results Display

  • Python DSPy Approach: Uses pandas.DataFrame to create and display rich, formatted tables of evaluation results, especially in Jupyter notebooks.
  • Elixir DSPEx Approach:
    • The DSPEx.Evaluate module does not handle presentation. Instead, its run/4 function returns a well-structured map (evaluation_result) containing all the raw statistical data.
    • Code Evidence (dspex/evaluate.ex):
      @type evaluation_result :: %{
        score: float(),
        stats: %{
          total_examples: non_neg_integer(),
          successful: non_neg_integer(),
          failed: non_neg_integer(),
          duration_ms: non_neg_integer(),
          success_rate: float(),
          throughput: float(),
          errors: [term()]
        }
      }
      
    • Why this way?: This is idiomatic Elixir. The library’s responsibility is to perform the computation and return structured data. The calling application (e.g., a Phoenix LiveView dashboard, a CLI tool, or an IEx session) is responsible for formatting that data for display. This makes the core library more reusable and less dependent on specific presentation environments.

2. Data Loading (from CSV)

  • Python DSPy Approach: Uses pandas.read_csv for robust data loading.
  • Elixir DSPEx Approach: This is a current gap.
    • The DSPEx codebase does not yet contain a dedicated data loader module equivalent to dspy.datasets.dataloader.
    • Future Plan: The idiomatic Elixir solution would be to use a dedicated CSV parsing library like NimbleCSV. It would parse a CSV file into a stream of maps, which would then be used to instantiate DSPEx.Example structs. The core logic would not depend on a “DataFrame” abstraction.

3. Interoperability (e.g., Snowflake)

  • Python DSPy Approach: Relies on external libraries like snowpark which provide a .to_pandas() method for data exchange.
  • Elixir DSPEx Approach: This is also a gap (as it’s a specific integration).
    • Future Plan: An Elixir database driver (e.g., Ecto or a specific Snowflake driver) would be used. These libraries return data as standard Elixir lists of maps or structs. DSPEx would consume these directly, without needing an intermediate DataFrame-like object. The principle is the same: rely on the standard data interchange format of the ecosystem.

How DSPEx Handles numpy Use Cases (Numerical & Vector Computation)

This is the most significant difference. The current DSPEx implementation does not yet contain the vector-based retrieval logic that necessitates numpy in DSPy. The components that are implemented use standard Elixir modules.

1. Vector Mathematics for Retrieval (Embeddings, Similarity, Ranking)

  • Python DSPy Approach: This is the critical use case for numpy, using np.array for storage, np.dot for similarity, and np.argsort for ranking.
  • Elixir DSPEx Approach: This functionality is not yet implemented.
    • The codebase has no equivalent to dspy/retrievers/embeddings.py. The “retrieval” demonstrated in the WeakStudentProgram test is a simple String.jaro_distance comparison, which is a string-based similarity metric, not a vector-based one.
    • Future Plan: This is where Nx (Numerical Elixir) would be introduced. The implementation would be a near-direct translation of the numpy logic:
      • np.ndarrayNx.Tensor: To store and manage embedding vectors.
      • np.dotNx.dot/2: For highly optimized dot-product similarity calculations.
      • np.linalg.normNx.LinAlg.norm/2: For vector normalization.
      • np.argsortNx.argsort/2: For efficiently finding the top-K results.
    • By deferring the introduction of Nx, the core application logic of DSPEx was built and stabilized first, with the performance-critical numerical component to be added later as a distinct feature.

2. Numerical & Statistical Operations in Optimizers

  • Python DSPy Approach: Uses np.exp, np.random, np.percentile, and np.average for advanced stochastic optimizers.
  • Elixir DSPEx Approach: The current optimizer, BootstrapFewShot, uses a simpler, more deterministic algorithm that doesn’t require these advanced statistical functions. It achieves its goals using standard Elixir Enum and Task modules.
    • Ranking: Instead of np.argsort, dspex/teleprompter/bootstrap_fewshot.ex uses Enum.sort_by/3 to rank demonstrations by their quality score.
    • Averaging: dspex/evaluate.ex calculates the average score using Enum.sum(scores) / length(scores), which is the direct Elixir equivalent of np.average.
    • Randomness: If random numbers were needed, Elixir’s built-in :rand module would be used.
    • Concurrency: Where DSPy might use numpy for batch operations, DSPEx uses Task.async_stream to achieve high performance through concurrency on multi-core systems, a different but equally powerful optimization strategy.

Summary Table: Python vs. Elixir Approaches

Use CasePython DSPy ApproachElixir DSPEx Approach (or Future Plan)
Evaluation Displaypandas.DataFrame.to_html()Returns a standard Elixir map. Presentation is the consumer’s job.
CSV Data Loadingpandas.read_csv()(Gap) Future: Use NimbleCSV to produce a stream of maps.
Vector Storagenumpy.ndarray(Gap) Future: Use Nx.Tensor.
Vector Similaritynumpy.dot or numpy.einsum(Gap) Future: Use Nx.dot/2.
Vector Rankingnumpy.argsort(Gap) Future: Use Nx.argsort/2.
Statistical Averagenumpy.averageEnum.sum/1 and length/1.
General Rankingnumpy.argsortEnum.sort_by/3.
Stochastic Opsnumpy.randomCurrent optimizers are deterministic. Future: use :rand module.