DSPy 3.x: Compile and Optimize LLM Pipelines Automatically
If you've spent hours rewriting the same prompt to squeeze out a few more correct answers, DSPy offers a different deal: describe what your LLM pipeline should do, then let an optimizer figure out the best prompts automatically.
DSPy (Declarative Self-improving Python) comes from Stanford NLP. The core idea is that prompts are not strings you tune by hand — they are parameters of a program, and those parameters should be compiled to maximize a measurable metric. Version 3.0 shipped in August 2025 and the framework reached 3.2.0 by early 2026, adding GEPA (accepted at ICLR 2026 as an oral presentation), native reasoning model support, and MLflow observability integration.
This guide explains how DSPy works, walks through its optimizer hierarchy, and shows exactly what Effloow Lab confirmed when running pip3 install dspy on DSPy 3.2.0.
Why Prompt Engineering Breaks at Scale
Hand-written prompts have a short shelf life. When you swap the underlying model, upgrade to a new API version, or add a retrieval step, the prompt you spent three days tuning often degrades or breaks. The same problem appears when you need to run the same pipeline across multiple task types or customer segments — you end up with a folder of near-identical .txt files that nobody owns.
Three failure modes make this worse in production:
- Brittleness: small wording changes in a prompt can silently drop accuracy by 10-20%.
- Portability: a prompt tuned for GPT-4 rarely transfers cleanly to Claude or llama.
- Observability: there is no systematic way to track why accuracy improved after a manual edit.
DSPy addresses all three by treating prompts as compiled artifacts rather than handwritten strings. You declare what you want; the optimizer decides how to ask for it.
Core Concepts in DSPy
Signatures: Typed Declarations of I/O
A DSPy Signature is a Python class that declares the input and output fields of one LLM call. You annotate fields with types and descriptions; DSPy expands them into structured prompts automatically.
import dspy
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of a text as positive, negative, or neutral."""
text: str = dspy.InputField(desc="The text to classify")
sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
When Effloow Lab ran this on DSPy 3.2.0, the resulting signature object showed:
SentimentClassifier(text -> sentiment
instructions='Classify the sentiment of a text as positive, negative, or neutral.'
text = Field(annotation=str required=True prefix='Text:')
sentiment = Field(annotation=str required=True prefix='Sentiment:')
)
DSPy handles field prefixes, type enforcement, and output parsing. You write the declaration; DSPy writes the prompt.
Modules: Strategies for Calling the LLM
A Module wraps a Signature with a calling strategy. The most common are:
dspy.Predict— direct call, single outputdspy.ChainOfThought— injects areasoningfield before the outputdspy.ReAct— interleaved reasoning and tool callsdspy.ProgramOfThought— generates and executes Python codedspy.BestOfN— samples N completions, returns the best by metricdspy.Reasoning— captures native chain-of-thought from reasoning models like Claude Opus 4.7
When you wrap SentimentClassifier in ChainOfThought, DSPy automatically expands the signature to include the reasoning step:
cot = dspy.ChainOfThought(SentimentClassifier)
# Prompt now includes: Text: → Reasoning: → Sentiment:
This auto-injection is why ChainOfThought consistently outperforms bare Predict without any extra prompt writing.
Optimizers: The Compilation Step
An optimizer takes your program, a labeled dataset, and a metric function, then searches for the best combination of prompts and few-shot examples. This is the core value proposition of DSPy. Three optimizers matter most in 2026:
BootstrapFewShot is the entry point. It runs your program across the training set, collects traces where the metric passes, and injects those traces as few-shot demonstrations. According to the DSPy paper published on OpenReview, GPT-3.5 using BootstrapFewShot outperformed standard few-shot by over 25% on complex reasoning tasks; llama2-13b-chat saw over 65% improvement.
MIPROv2 (Multi-prompt Instruction PRoposal Optimizer v2) is the production choice. It jointly optimizes both the instruction text and the few-shot examples using Bayesian Optimization. In each trial it proposes a new combination of instructions and demos, evaluates on a mini-batch, and updates a surrogate model. The DSPy docs report 5-46% improvement over expert-crafted demonstrations on GPT-family models.
from dspy.teleprompt import MIPROv2
teleprompter = MIPROv2(
metric=my_accuracy_metric,
auto="medium", # "light", "medium", or "heavy" search budget
)
optimized_program = teleprompter.compile(
dspy.ChainOfThought(SentimentClassifier),
trainset=train_data,
valset=val_data,
)
optimized_program.save("sentiment_optimized.json")
GEPA (Reflective Prompt Evolution) is the newest addition and was accepted as an Oral paper at ICLR 2026. It maintains a Pareto frontier of candidate prompts and uses textual feedback to guide mutation. The paper (arxiv 2507.19457) reports GEPA outperforms MIPROv2 by over 10% on average and beats GRPO by 6% on average while using up to 35x fewer rollouts. On AIME-2025, GPT-4.1 Mini went from 46.6% to 56.6% accuracy after GEPA optimization.
import dspy
optimized = dspy.GEPA(
metric=accuracy_metric,
num_threads=8,
).compile(program, trainset=train_data)
The Optimizer Hierarchy
Choosing the right optimizer depends on how much compute you can spend:
| Optimizer | What It Optimizes | Compute Cost | Typical Lift |
|---|---|---|---|
| LabeledFewShot | Examples only (from labels) | Zero | Baseline |
| BootstrapFewShot | Examples via trace collection | Low (1x training) | +25–65% |
| MIPROv2 | Instructions + examples (Bayesian) | Medium (N trials) | +5–46% over expert |
| GEPA | Reflective evolution (Pareto) | High | +10% over MIPROv2 |
For most teams, MIPROv2 is the practical choice: it covers both prompt instruction and few-shot examples, uses Bayesian search to be more efficient than brute force, and serializes to a portable JSON file you can check into Git. GEPA is worth the compute budget when you are chasing the last few points of accuracy on a hard benchmark or you have explicit textual feedback to feed back into the optimization loop.
Practical Application: Sentiment Classifier with MIPROv2
Here is the minimal working pattern for optimizing a classifier from scratch. You need a configured LM (any provider via LiteLLM syntax) and a labeled dataset.
import dspy
from dspy.teleprompt import MIPROv2
# 1. Configure LM (free tier: gpt-4o-mini, claude-haiku-4-5, or ollama/llama3)
lm = dspy.LM("openai/gpt-4o-mini", api_key="...")
dspy.configure(lm=lm)
# 2. Define typed signature
class ReviewClassifier(dspy.Signature):
"""Classify a product review as positive, negative, or neutral."""
review: str = dspy.InputField()
label: str = dspy.OutputField(desc="positive, negative, or neutral")
# 3. Define metric
def accuracy_metric(example, pred, trace=None):
return example.label == pred.label
# 4. Create program and dataset
program = dspy.ChainOfThought(ReviewClassifier)
trainset = [
dspy.Example(review="Love this product!", label="positive").with_inputs("review"),
dspy.Example(review="Terrible quality.", label="negative").with_inputs("review"),
# ... 20-50 labeled examples recommended
]
# 5. Compile with MIPROv2
optimizer = MIPROv2(metric=accuracy_metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
# 6. Save and reload
optimized.save("review_classifier.json")
reloaded = program.load("review_classifier.json")
The output JSON contains the optimized instruction text and few-shot demos for every predictor in the pipeline. When you reload it, you get the same accuracy without re-running the optimizer.
Working with Multi-Stage Pipelines
DSPy's real advantage shows in pipelines with multiple LLM calls. Each stage is a separate Module; each Module has its own optimizer parameters. MIPROv2 tunes every stage simultaneously.
class RAGPipeline(dspy.Module):
def __init__(self, retriever):
self.retrieve = dspy.Retrieve(k=3)
self.answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.answer(context=context, question=question)
This 10-line class is fully optimizable. Pass it to MIPROv2 and it will tune both the retrieval query generation and the answer generation independently, without you writing or modifying a single prompt string. The DSPy docs show a StackExchange RAG example going from 53% to 61% on a standard QA metric using this pattern.
Common Mistakes
Mistake 1: Skipping the metric. Without a well-defined metric, optimizers have nothing to search toward. Exact match is fine to start; semantic similarity metrics like BERTScore work when exact match is too strict.
Mistake 2: Too small a training set. BootstrapFewShot needs at least 20 examples to collect enough high-quality traces. MIPROv2 works better with 50-100. Fewer examples and the optimizer overfits to noise.
Mistake 3: Running MIPROv2 with auto="heavy" on an expensive model. Heavy mode runs many Bayesian trials, each evaluating against the full validation set. On GPT-4o, this can cost $10-30 per optimization run. Start with auto="light" to verify the pipeline works, then scale up.
Mistake 4: Not serializing. An optimized DSPy program that only lives in memory is wasted. Always call .save("program.json") and commit the file alongside your code. This is your compiled artifact — treat it like a build output.
Mistake 5: Assuming optimization transfers across models. Prompts optimized for gpt-4o-mini may not transfer to claude-haiku-4-5. Re-run the optimizer when you switch providers; the JSON file format is portable, but the content should be re-derived.
Observability and Production Integration
DSPy 3.x added native MLflow integration as a first-class concern. Every optimizer run logs trials, metrics, and the winning program to an MLflow experiment:
import mlflow
mlflow.dspy.autolog()
with mlflow.start_run():
optimized = MIPROv2(metric=my_metric).compile(program, trainset=train)
This gives you a full audit trail of what was tried, what metric score each trial achieved, and which configuration won. Teams building production AI pipelines can track optimization history across model versions the same way they track software test runs.
For cost management, DSPy's caching layer (dspy.configure_cache) deduplicates identical LLM calls during optimization — essential when running multi-trial optimizers against large training sets.
DSPy vs LangChain in 2026
The comparison that developers ask most often: DSPy has lower runtime overhead (~3.5ms framework latency vs ~10ms for LangChain, per Qdrant's 2025 benchmark), but the real distinction is philosophical. LangChain assumes you will write and maintain prompt strings. DSPy assumes you will declare types and let the optimizer write them.
For applications with a clear measurable metric — classification accuracy, answer correctness, retrieval precision — DSPy's compilation model wins on both maintainability and peak performance. For applications where the behavior is hard to quantify or where you need rapid prototyping without training data, LangChain's explicit prompt approach remains practical.
The two are not mutually exclusive. Teams sometimes use LangChain for plumbing (integrations, callbacks, memory) while wrapping the core LLM calls in DSPy Signatures for optimization.
Q: Do I need fine-tuned models to use DSPy?
No. DSPy optimizes prompts and few-shot examples, not model weights. BootstrapFewShot, MIPROv2, and GEPA all operate at inference time against any provider. BootstrapFinetune and BetterTogether can extend into fine-tuning if you want that path, but the prompt-only route is where most teams start.
Q: Can DSPy work with local models?
Yes. DSPy uses LiteLLM under the hood, so any model accessible via ollama/llama3, ollama/qwen3, or a local OpenAI-compatible server works:
lm = dspy.LM("ollama/llama3", api_base="http://localhost:11434")
dspy.configure(lm=lm)
This makes it viable for air-gapped environments and cost-sensitive experimentation.
Q: How many labeled examples do I need?
For BootstrapFewShot: 20-50 is a reasonable minimum. For MIPROv2 auto="medium": 50-100. The optimizer can work with less, but quality drops as the training set shrinks. For GEPA, which uses explicit textual feedback, you can sometimes get useful signal with fewer examples if your feedback strings are detailed.
Q: What is the difference between MIPROv2 light, medium, and heavy?
The auto parameter controls the number of Bayesian trials and the mini-batch size per trial. light is a quick sanity check (a few dozen LLM calls), medium is the default production setting (hundreds of calls), and heavy runs an exhaustive search suitable for high-stakes pipelines where compute is not a constraint.
Key Takeaways
DSPy 3.x offers a systematic alternative to manual prompt engineering. The framework is installable in a single pip install dspy and confirmed working at version 3.2.0 in Effloow Lab's environment. The optimizer hierarchy — BootstrapFewShot for a quick baseline, MIPROv2 for production joint optimization, GEPA for peak accuracy — maps cleanly to development stages.
The benchmark numbers are real but come from the Stanford NLP paper and official documentation, not from Effloow-run trials: GPT-3.5 gaining 25%+ over standard few-shot, MIPROv2 beating expert-crafted prompts by 5-46%, and GEPA improving on MIPROv2 by another 10%+ (ICLR 2026 Oral). The pattern that unlocks these gains is: define typed signatures, pick a module strategy, measure with a metric, compile with an optimizer, serialize the result.
DSPy is the right choice when you have a measurable metric and want prompts that are maintainable, portable, and optimized systematically. The install is trivial, the optimizer API is approachable after the first pipeline, and the compiled JSON artifact treats prompts as a build output rather than tribal knowledge embedded in comments. If you are still editing prompt strings by hand for production pipelines, DSPy 3.x is worth an afternoon of experimentation.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.