← Back to 20250617

200 dspy dependency review

Documentation for 200_dspy_dependency_review from the Ds ex repository.

Looking through the DSPy codebase, here are the major libraries used at key points (excluding concurrency/threading and Optuna):

Core Dependencies

1. LiteLLM (Critical - Core LM Interface)

  • Primary LM Backend: All language model calls go through LiteLLM
  • Provider Abstraction: Handles OpenAI, Anthropic, Google, local models, etc.
  • Response Formatting: Standardizes responses across different LM providers
  • Caching Integration: Built-in caching capabilities
  • Usage: clients/lm.py, throughout adapters, streaming
import litellm
response = litellm.completion(model=self.model, messages=messages, **kwargs)

2. Requests (HTTP Client)

  • Web Search: Used extensively in retrieve modules
  • API Calls: Custom integrations with various services
  • File Downloads: Downloading models, data
  • Usage: Most retriever modules, databricks integration, web search
response = requests.get(url, headers=headers, timeout=10)

3. NumPy (Numerical Computing)

  • Embeddings: Vector operations for retrieval
  • Similarity Calculations: Cosine similarity, distance metrics
  • Array Operations: Data manipulation in retrievers
  • Usage: Embeddings, FAISS integration, various retrievers
import numpy as np
scores = np.dot(query_embeddings, doc_embeddings.T)

Data Processing & Serialization

4. JSON Libraries

  • ujson: Fast JSON serialization for performance-critical paths
  • json-repair: Robust JSON parsing for LM outputs
  • Usage: Throughout adapters, caching, data persistence
import ujson
import json_repair
data = json_repair.loads(completion)  # Handles malformed JSON from LMs

5. PyYAML

  • Configuration Files: Loading YAML configs
  • Dataset Definitions: Dataset configuration files
  • Usage: alfworld integration, configuration management
import yaml
config = yaml.safe_load(f)

Machine Learning & AI

6. Transformers/HuggingFace (Optional but Common)

  • Local Models: Running local language models
  • Embeddings: Sentence transformers for retrieval
  • Tokenization: Token counting and processing
  • Usage: Local model providers, embedding generation
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(model_name)

7. OpenAI SDK

  • Direct API Calls: Fine-tuning, embeddings
  • Structured Outputs: OpenAI-specific features
  • File Management: Uploading training data
  • Usage: OpenAI provider, fine-tuning, embeddings
import openai
response = openai.chat.completions.create(...)

8. Torch/PyTorch (Optional)

  • Local Model Inference: Running PyTorch models
  • GPU Operations: CUDA operations for local models
  • Tensor Operations: Model computations
  • Usage: Local model providers, custom model integration
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  • Vector Search: Fast similarity search for embeddings
  • Indexing: Building and querying vector indices
  • Usage: Embeddings retriever, KNN operations
import faiss
index = faiss.IndexFlatL2(dimension)

10. ChromaDB, Pinecone, Weaviate, etc.

  • Vector Database Clients: Each has its own integration
  • Document Storage: Persistent vector storage
  • Retrieval: Semantic search capabilities
  • Usage: Specific retriever modules for each DB

Data Validation & Type Safety

11. Pydantic (Already covered extensively)

  • Core Type System: Signatures, custom types, validation
  • JSON Schema: Structured outputs, API validation

12. Typing Extensions

  • Advanced Types: Union, Literal, get_origin, get_args
  • Type Introspection: Runtime type checking
  • Usage: Throughout signature system, type validation
from typing import get_origin, get_args, Union, Literal
origin = get_origin(annotation)

Web Framework Integration

13. FastAPI/Starlette (Implied for web deployments)

  • API Endpoints: Web service deployment
  • Request Validation: Pydantic integration
  • Async Support: Async endpoint handling

Development & Testing

14. TQDM

  • Progress Bars: Long-running operations
  • User Feedback: Visual progress indication
  • Usage: Evaluation, training, data processing
from tqdm import tqdm
for item in tqdm(dataset, desc="Processing"):

15. Pandas (Data Analysis)

  • Data Manipulation: Dataset processing
  • Result Analysis: Evaluation results
  • CSV Handling: Data import/export
  • Usage: Evaluation framework, data processing
import pandas as pd
df = pd.DataFrame(results)

Utilities & Helpers

16. Hashlib/UUID

  • Unique Identifiers: Example IDs, cache keys
  • Hashing: Content hashing for caching
  • Usage: Throughout for ID generation
import uuid
import hashlib
unique_id = str(uuid.uuid4())

17. Collections

  • defaultdict: Default dictionaries
  • Counter: Counting operations
  • deque: Queue operations
  • Usage: Throughout for data structures
from collections import defaultdict, Counter, deque

18. Inspect

  • Code Introspection: Getting function signatures
  • Source Code: Extracting source for optimization
  • Usage: Teleprompters, signature analysis
import inspect
signature = inspect.signature(func)

19. Pathlib

  • File Operations: Modern path handling
  • Cross-platform: Path manipulation
  • Usage: File I/O, model saving/loading
from pathlib import Path
path = Path(directory) / filename

Specialized Integrations

20. CloudPickle

  • Serialization: Advanced object serialization
  • Model Persistence: Saving complex models
  • Usage: Model saving/loading
import cloudpickle
cloudpickle.dump(model, file)

21. Asyncer

  • Async Utilities: Converting sync to async
  • Thread Pool: Async thread execution
  • Usage: utils/asyncify.py
import asyncer
async_func = asyncer.asyncify(sync_func)

22. DiskcCache

  • Disk Caching: Persistent caching
  • Performance: Fast disk-based cache
  • Usage: Caching system
import diskcache
cache = diskcache.FanoutCache(directory)

23. CacheTools

  • Memory Caching: LRU and other cache strategies
  • Performance: In-memory caching
  • Usage: Caching system
import cachetools
cache = cachetools.LRUCache(maxsize=1000)

Key Observations

  1. LiteLLM is absolutely critical - it’s the universal interface to all language models
  2. Pydantic forms the type system backbone - used everywhere for validation
  3. NumPy is essential for embeddings/retrieval - all vector operations depend on it
  4. JSON handling is crucial - both ujson and json-repair for performance and robustness
  5. Requests is the HTTP workhorse - used throughout for external API calls
  6. Various vector DB clients - each retrieval system has its own dependencies

The codebase is designed to be modular with optional dependencies - you can use DSPy without installing every vector database client, but LiteLLM, Pydantic, and the core Python libraries are essential.