Looking through the DSPy codebase, here are the major libraries used at key points (excluding concurrency/threading and Optuna):
Core Dependencies
1. LiteLLM (Critical - Core LM Interface)
- Primary LM Backend: All language model calls go through LiteLLM
- Provider Abstraction: Handles OpenAI, Anthropic, Google, local models, etc.
- Response Formatting: Standardizes responses across different LM providers
- Caching Integration: Built-in caching capabilities
- Usage:
clients/lm.py
, throughout adapters, streaming
import litellm
response = litellm.completion(model=self.model, messages=messages, **kwargs)
2. Requests (HTTP Client)
- Web Search: Used extensively in retrieve modules
- API Calls: Custom integrations with various services
- File Downloads: Downloading models, data
- Usage: Most retriever modules, databricks integration, web search
response = requests.get(url, headers=headers, timeout=10)
3. NumPy (Numerical Computing)
- Embeddings: Vector operations for retrieval
- Similarity Calculations: Cosine similarity, distance metrics
- Array Operations: Data manipulation in retrievers
- Usage: Embeddings, FAISS integration, various retrievers
import numpy as np
scores = np.dot(query_embeddings, doc_embeddings.T)
Data Processing & Serialization
4. JSON Libraries
- ujson: Fast JSON serialization for performance-critical paths
- json-repair: Robust JSON parsing for LM outputs
- Usage: Throughout adapters, caching, data persistence
import ujson
import json_repair
data = json_repair.loads(completion) # Handles malformed JSON from LMs
5. PyYAML
- Configuration Files: Loading YAML configs
- Dataset Definitions: Dataset configuration files
- Usage: alfworld integration, configuration management
import yaml
config = yaml.safe_load(f)
Machine Learning & AI
6. Transformers/HuggingFace (Optional but Common)
- Local Models: Running local language models
- Embeddings: Sentence transformers for retrieval
- Tokenization: Token counting and processing
- Usage: Local model providers, embedding generation
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(model_name)
7. OpenAI SDK
- Direct API Calls: Fine-tuning, embeddings
- Structured Outputs: OpenAI-specific features
- File Management: Uploading training data
- Usage: OpenAI provider, fine-tuning, embeddings
import openai
response = openai.chat.completions.create(...)
8. Torch/PyTorch (Optional)
- Local Model Inference: Running PyTorch models
- GPU Operations: CUDA operations for local models
- Tensor Operations: Model computations
- Usage: Local model providers, custom model integration
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Vector Databases & Search
9. FAISS (Facebook AI Similarity Search)
- Vector Search: Fast similarity search for embeddings
- Indexing: Building and querying vector indices
- Usage: Embeddings retriever, KNN operations
import faiss
index = faiss.IndexFlatL2(dimension)
10. ChromaDB, Pinecone, Weaviate, etc.
- Vector Database Clients: Each has its own integration
- Document Storage: Persistent vector storage
- Retrieval: Semantic search capabilities
- Usage: Specific retriever modules for each DB
Data Validation & Type Safety
11. Pydantic (Already covered extensively)
- Core Type System: Signatures, custom types, validation
- JSON Schema: Structured outputs, API validation
12. Typing Extensions
- Advanced Types: Union, Literal, get_origin, get_args
- Type Introspection: Runtime type checking
- Usage: Throughout signature system, type validation
from typing import get_origin, get_args, Union, Literal
origin = get_origin(annotation)
Web Framework Integration
13. FastAPI/Starlette (Implied for web deployments)
- API Endpoints: Web service deployment
- Request Validation: Pydantic integration
- Async Support: Async endpoint handling
Development & Testing
14. TQDM
- Progress Bars: Long-running operations
- User Feedback: Visual progress indication
- Usage: Evaluation, training, data processing
from tqdm import tqdm
for item in tqdm(dataset, desc="Processing"):
15. Pandas (Data Analysis)
- Data Manipulation: Dataset processing
- Result Analysis: Evaluation results
- CSV Handling: Data import/export
- Usage: Evaluation framework, data processing
import pandas as pd
df = pd.DataFrame(results)
Utilities & Helpers
16. Hashlib/UUID
- Unique Identifiers: Example IDs, cache keys
- Hashing: Content hashing for caching
- Usage: Throughout for ID generation
import uuid
import hashlib
unique_id = str(uuid.uuid4())
17. Collections
- defaultdict: Default dictionaries
- Counter: Counting operations
- deque: Queue operations
- Usage: Throughout for data structures
from collections import defaultdict, Counter, deque
18. Inspect
- Code Introspection: Getting function signatures
- Source Code: Extracting source for optimization
- Usage: Teleprompters, signature analysis
import inspect
signature = inspect.signature(func)
19. Pathlib
- File Operations: Modern path handling
- Cross-platform: Path manipulation
- Usage: File I/O, model saving/loading
from pathlib import Path
path = Path(directory) / filename
Specialized Integrations
20. CloudPickle
- Serialization: Advanced object serialization
- Model Persistence: Saving complex models
- Usage: Model saving/loading
import cloudpickle
cloudpickle.dump(model, file)
21. Asyncer
- Async Utilities: Converting sync to async
- Thread Pool: Async thread execution
- Usage:
utils/asyncify.py
import asyncer
async_func = asyncer.asyncify(sync_func)
22. DiskcCache
- Disk Caching: Persistent caching
- Performance: Fast disk-based cache
- Usage: Caching system
import diskcache
cache = diskcache.FanoutCache(directory)
23. CacheTools
- Memory Caching: LRU and other cache strategies
- Performance: In-memory caching
- Usage: Caching system
import cachetools
cache = cachetools.LRUCache(maxsize=1000)
Key Observations
- LiteLLM is absolutely critical - it’s the universal interface to all language models
- Pydantic forms the type system backbone - used everywhere for validation
- NumPy is essential for embeddings/retrieval - all vector operations depend on it
- JSON handling is crucial - both ujson and json-repair for performance and robustness
- Requests is the HTTP workhorse - used throughout for external API calls
- Various vector DB clients - each retrieval system has its own dependencies
The codebase is designed to be modular with optional dependencies - you can use DSPy without installing every vector database client, but LiteLLM, Pydantic, and the core Python libraries are essential.